University of Liège – University of Michigan

Model­free Monte Carlo­like Policy Evaluation

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

March, 30th  2010 Heeze, The Netherlands

1

Outline Introduction Problem statement The Monte Carlo estimator The Model­Free Monte Carlo estimator Illustration Conclusions and future work

"Model­Free  Monte  Carlo­like  policy  evaluation".  R.  Fonteneau,  S.A.  Murphy,    L.  Wehenkel  and  D.  Ernst  (2010).  In  Proceedings  of  The  Thirteenth  International  Conference  on  Artificial  Intelligence  and  Statistics (AISTATS) 2010, JMLR W&CP 9, pp 217­224, Chia Laguna, Sardinia, Italy, May 13­15 2010. 2

Introduction ●









Discrete­time stochastic optimal control problems arise in many  fields (finance, medecine, engineering,...) Many techniques for solving such problems use an oracle that  evaluates the performance of any given policy in order to  determine a (near­)optimal control policy When the system is accessible to experimentation, such an oracle  can be based on a Monte Carlo (MC) approach In this paper, the only information is contained in a sample of one­ step transitions of the system In this context, we propose a ``Model­Free Monte Carlo (MFMC)  estimator'' of the performance of a given policy that mimics in  some way the Monte Carlo estimator. 3

Problem statement ●



We consider a discrete­time system whose dynamics over T stages  is given by xt+1  = f (xt  , ut  , wt) All xt   lie in a normed state space X, all ut lie in a normed action  space U, wt are i.i.d. according to a probability distribution pW(.)



An instantaneous reward rt = ρ (xt  , ut  , wt) is associated with the  action ut   while being in state xt 



A policy h: {0,...,T­1} × X  U is given, and we want to evaluate its  performance.

4

Problem statement ●

The expected return of the policy h when starting from an initial  state x0 = x is given by

where

x0

w0

w1

r0

r1

x1

w T −2

x2

x T−2 r T −2

w T −1

x T−1

r T −1

T−1

R  x 0=∑ r t h

xT

t =0

5

Problem statement ●

Problem: the functions f, ρ and pW(.) are unknown



They are replaced by a sample of system transitions



How to evaluate Jh(x0) in this context ?

6

The Monte Carlo estimator ●

We define the Monte Carlo estimator of the expected return of h  when starting from the initial state x0:

with

7

The Monte Carlo estimator w

w

x0

1 0

r

w

w 0p

r

r 1T −2

x 1T −2

x 12

x

2 1

r

2 1

r

x 2T −2 r

x 22

2 T −2

1 r ∑ t

x

1 T −1

1 T

t =0

x 1T −1 w 2T−1

w 2T−2

w 21

2 0

r 20 r 0p

x

1 0

1 1

w

w 1T−2

1 1 1 1

T −1

1 T−1

x 2T −1

T −1

r 2T −1 x 2 T

∑ r 2t t =0

         MC Estimator

p T −1

w 1p

x

p 1

r 1p

1 i r ∑∑ p i=1 t=0 t

p w T−2

x

p 2

x

p T −2

r Tp −2

p

x T −1

p w T−1

r

T −1

p T −1

p

xT x it 1 =f  x it , h t , x it  , wit 

r it = x it , h t , x it  , wit 

p r ∑ t t =0

w it ~ pW .

8

The Monte Carlo estimator ●



We assume that the random variable Rh(x0) admits a finite  variance

The bias and variance of the Monte Carlo estimator are

9

The Model­free Monte Carlo estimator ●





Here, the MC approach is not feasible, since the system is  unknown The only information available on the system is gathered in a  sample of n one­step transitions

We define the random variable         as follows: a

The set of pairs                                           is arbitrary chosen, a

whereas the pairs (rl  , yl) are determined by  ( ρ (xl, ul , .) , f (xl , ul , .)) drawn according to pW(.)  ●

         is a realization of the random set         .          

10

The Model­free Monte Carlo estimator ●









We introduce the Model­Free Monte Carlo (MFMC) estimator From the sample of transitions, we build p sequences of  transitions of length T called ``broken trajectories'' These broken trajectories are built so as to minimize the  discrepancy (using a distance metric ∆) with a classical MC sample  that could be obtained by simulating the system with the policy h We average the cumulated returns over the p broken trajectories  to compute an estimate of the expected return of h The algorithm has complexity O(npT) .

11

The Model­free Monte Carlo estimator

12

The Model­free Monte Carlo estimator 1

1

l1

l

1 0

l

1 0

l

1 0

l

1 0

x , u , r , y  w w

1

l1

l1

x , u , r , y  11 w

x

1 1

1 2

x

1 T −2

w w

1

l0

1 T −1 1



1

l1

wl

1

l0

1

l1

1 0

l

l T −2

w

x

1 T −1

T −1

1

l T −1

wl

x

1 T

∑r

1

lt

t =0

1 T −1

x

1 T −2

2 T

x 11=f  x , h0, x , w l 

T −1

∑r

2

lt

t =0

1

0

x0

i

l0

w w p 0



w

i

l1

w w

l

p 0

w

l

i t

i

l T −1

Real trajectory under disturbances

i

l0

w ,... , w

p

p T −1

  x

p 1

x

p 2

p T −1

1 rl ∑ ∑ p i=1 t=0

Transition generated i lt under disturbance  w

l0

p 1

MFMC Estimator

i

l T −1

x

p T −2

x Tp−1

T −1

∑r

p−1

lt

t =0

x Tp −1

T −1

x Tp

∑r t =0

p

lt

13

i t

The Model­free Monte Carlo estimator ●

We assume that the functions f, ρ and h are Lipschitz continuous

14

The Model­free Monte Carlo estimator ●

Distance metric ∆



k­sparsity



                      denotes the distance of (x,u) to its k­th nearest  neighbor (using the distance ∆) in the sample 15



Illustration of the k­sparsity

X The k­sparsity can be seen as the smallest  radius such that all ∆­balls in X×U contain at least k elements from

(x',u')



Pn 1

Pn k−1

 x , u

  x , u

Pn k

(x,u)

  x , u

U

16

The Model­free Monte Carlo estimator ●

Bias of the MFMC estimator:



Theorem

17

The Model­free Monte Carlo estimator ●

Variance of the MFMC estimator:



Theorem

18

Illustration ●

System



pW(.) is uniform over W, T = 15, x0 = ­0.5  .

19

Illustration ●

Simulations for p = 10, n = 100 … 10 000, uniform grid

Monte Carlo estimator  with p trajectories 20

Illustration ●

Simulations for p = 1 … 100, n = 10 000, uniform grid

Monte Carlo estimator 21

Conclusions and Future work ●









We have proposed in this paper an estimator of the expected  return of a policy in a model­free setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC  estimator The bias and variance of the MFMC estimator converge to the bias  and variance of the MC estimator The MFMC estimator could be used in a direct policy search  framework Possible extensions (conditional probability distributions,  parameter estimation, etc) .

22

Modelfree Monte Carlolike Policy Evaluation

Mar 30, 2010 - In Proceedings of The Thirteenth International Conference on Artificial Intelligence and. Statistics (AISTATS) 2010, JMLR W&CP 9, pp 217224, ...

1MB Sizes 1 Downloads 219 Views

Recommend Documents

Modelfree Monte Carlolike Policy Evaluation - Orbi (ULg)
May 19, 2010 - Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a ...

Modelfree Monte Carlolike Policy Evaluation
Mar 30, 2010 - can be based on a Monte Carlo (MC) approach. ○. In this paper, the only information is contained in a sample of one step transitions of the system. ○. In this context, we propose a ``ModelFree Monte Carlo (MFMC) estimator'' of the

Model-Free Monte Carlo–like Policy Evaluation
defined by a random draw of n disturbance signals wl l = 1 ...n. The sample of .... 5 Illustration ... of nj considered in our experiments, the 50 values out- putted by ...

Model-free Monte Carlo–like policy evaluation - ORBi
Résumé : We propose an algorithm for estimating the finite-horizon expected return of a closed loop ... “on-policy” trajectories are generated by collecting information from the system when controlled by the given ... sample of one-step transit

ABILITY AND EDUCATION IN THE POLICY EVALUATION ...
biases) and on the shape and variability of marginal returns to education. .... In the simplest model, we ignore option values of continuing studying after the level ...

ABILITY AND EDUCATION IN THE POLICY EVALUATION ...
choice with hindsight. The correct economic incentives are provided if the economic system allows individuals to have a sorting gain. In other words, the ...

MPS Wellness Policy Evaluation Results 2016-2017.pdf
Minot Public Schools Wellness Policy. Evaluation Form. 18 responses. School Name (18 responses). Principal (18 responses). Date (7 responses). Bel air. Memorial. Lewis and Clark. Roosevelt Elementary School. Longfellow Elementary. Jim Hill Middle Sch

Transport policy evaluation in metropolitan areas: The ...
E-mail addresses: [email protected] (M. Hatzopoulou), ..... Columbia Greater Vancouver Transportation Authority Act in 1998. Translink is ...... Engineers and planners; the TDM coordinator has a marketing background.

Transport policy evaluation in metropolitan areas: The role ... - CiteSeerX
more complicated and data-intensive tools unless it is proven that the new model ...... elasticity models, and Cost Benefit Analysis or Cost Recovery Analysis). ... Many participants raised the issue that they are having a hard time finding skilled .

Transport policy evaluation in metropolitan areas: The role ... - CiteSeerX
In fact, the decentralization of the Canadian government structure, involving three levels of government ...... jobs that must be promoted (''For the master plan, we have developed a ..... Computers, Environment and Urban Systems 28, 9–44.

The Propensity Score method in public policy evaluation
When the data, the type of intervention and the assignment criterion allow it, a quasi- ... The evaluation of policies carried out by using quantitative tools is a tangible ..... the unobservable characteristics is also guaranteed if the size of the

Monte Beigua.pdf
Riccardo. Page 1 of 1. Monte Beigua.pdf. Monte Beigua.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Monte Beigua.pdf.

Policy & Practice - External Evaluation 2015.pdf
Page 1 of 27. E v a l u a t i o n o f P o l i c y & P r a c t i c e. 1. A n e x t e r n a l e va l u a t i o n b y : Dr Seán Byers. Queen's University Belfast. June 2015.

Monte Carlo.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Monte Carlo.pdf.

Monte HS Menu.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. Monte HS Menu.pdf. Monte HS Menu.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying ...

CRONOGRAMA MONTE CARMELO.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. CRONOGRAMA ...

Monte Carlo Simulation
You are going to use simulation elsewhere in the .... If we use Monte Carlo simulation to price a. European ...... Do not put all of your “business logic” in your GUI.

MONTE DI MURIS.pdf
Page 2 of 2. MONTE DI MURIS.pdf. MONTE DI MURIS.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying MONTE DI MURIS.pdf. Page 1 of 2.

GIRO MONTE MISA.pdf
Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. GIRO MONTE MISA.pdf. GIRO MONTE MISA.pdf.

SKYRACE MONTE DIMON.pdf
partenza del percorso prm Castello di Valdaier. direzione del percorso. Whoops! There was a problem loading this page. SKYRACE MONTE DIMON.pdf.

TREKKING DELLE ORCHIDEE (MONTE SAMBUCO).pdf ...
TREKKING DELLE ORCHIDEE (MONTE SAMBUCO).pdf. TREKKING DELLE ORCHIDEE (MONTE SAMBUCO).pdf. Open. Extract. Open with. Sign In.

The Count of Monte Cristo.pdf
Retrying... Whoops! There was a problem previewing this document. Retrying... Download ... The Count of Monte Cristo.pdf. The Count of Monte Cristo.pdf. Open.

Introduction to Monte Carlo Simulation
Crystal Ball Global Business Unit ... Simulation is the application of models to predict future outcomes ... As an experimenter increases the number of cases to.

Sequential Monte Carlo multiple testing
Oct 13, 2011 - An example of such a local analysis is the study of how the relation ... and then perform a statistical test of a null hypothesis H0 versus. ∗To whom ... resampling risk (Gandy, 2009), and prediction of P-values using. Random ...