University of Liège – University of Michigan

Model­free Monte Carlo­like Policy Evaluation

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst May, 19th  2010

CAp 2010, Clermont­Ferrand, France

0

1

T

Time

Evaluation

1

n Patients Therapy to evaluate

?

Outline Introduction Problem statement The Monte Carlo estimator The Model­Free Monte Carlo estimator MFMC estimator: analysis Illustration Conclusions and future work

3

Introduction ●









Discrete­time stochastic optimal control problems arise in many  fields (finance, medecine, engineering,...) Many techniques for solving such problems use an oracle that  evaluates the performance of any given policy in order to  determine a (near­)optimal control policy When the system is accessible to experimentation, such an oracle  can be based on a Monte Carlo (MC) approach In this paper, the only information is contained in a sample of one­ step transitions of the system In this context, we propose a Model­Free Monte Carlo (MFMC)  estimator of the performance of a given policy that mimics in some  way the Monte Carlo estimator. 4

Problem statement ●



We consider a discrete­time system whose dynamics over T stages  is given by

All xt   lie in a normed state space X, all ut lie in a normed action  space U, wt are i.i.d. according to a probability distribution pW(.)





An instantaneous reward  action ut   while being in state xt 

is associated with the 

A policy h: {0,...,T­1} × X  U is given, and we want to evaluate  its performance.

5

Problem statement ●

 

The expected return of the policy h when starting from an initial  state x0 is given by

where

with

w0

x0

r0

x1

w1

w T −2

r1

r T −2

x2

x T−2

w T −1

x T−1

r T −1

xT

Problem statement

 



Problem: the functions f, ρ and pW(.) are unknown



They are replaced by a sample of system transitions

where the pairs are determined by  according to pW(.) 

are arbitrary chosen and the pairs    , where wl is drawn   

 

How to evaluate Jh(x0) in this context? 7

The Monte Carlo estimator ●

 

We define the Monte Carlo estimator of the expected return of h  when starting from the initial state x0:

with

8

The Monte Carlo estimator w w 10 r

w

x0

w 0p

r

r 1T −2

x 1T −2

x 12

x

2 1

r

2 1

r

2 x 2T −2 r T −2

x 22

1 r ∑ t

x

1 T −1

1 T

w 2T −1

x 2T −1

T −1

r 2T −1 x 2 T

t =0

         MC Estimator

p 1

w 1p

w Tp −2

r 1p

r Tp −2

x

p 2

x

p T −2

1 i r ∑∑ p i=1 t=0 t p

x T −1

w Tp −1 r

p T −1

T −1

p

xT w ~ pW .

∑ r 2t p T −1

x

i t

t =0

x 1T −1

w 2T −2

w 21

2 0

r 20 r 0p

x

1 0

1 1

w

w 1T −2

1 1 1 1

T −1

1 T −1

p r ∑ t t =0

9

The Monte Carlo estimator ●



We assume that the random variable Rh(x0) admits a finite  variance

The bias and variance of the Monte Carlo estimator are

10

The Model­free Monte Carlo estimator ●











Here, the MC approach is not feasible, since the system is  unknown We introduce the Model­Free Monte Carlo estimator From the sample of transitions, we build p sequences of different  transitions of length T called ``broken trajectories'' These broken trajectories are built so as to minimize the  discrepancy (using a distance metric ∆) with a classical MC sample  that could be obtained by simulating the system with the policy h We average the cumulated returns over the p broken trajectories  to compute an estimate of the expected return of h The algorithm has complexity O(npT) . 11

The Model­free Monte Carlo estimator

12

The Model­free Monte Carlo estimator How does it work ? Example with T=3, p=2, n=8

13

The Model­free Monte Carlo estimator How does it work ?

14

The Model­free Monte Carlo estimator How does it work ?

15

The Model­free Monte Carlo estimator How does it work ?

16

The Model­free Monte Carlo estimator How does it work ?

The Model­free Monte Carlo estimator How does it work ?

18

The Model­free Monte Carlo estimator How does it work ?

19

The Model­free Monte Carlo estimator How does it work ?

The Model­free Monte Carlo estimator How does it work ?

21

The Model­free Monte Carlo estimator How does it work ?

22

The Model­free Monte Carlo estimator How does it work ?

The Model­free Monte Carlo estimator How does it work ?

24

The Model­free Monte Carlo estimator How does it work ?

25

The Model­free Monte Carlo estimator How does it work ?

The Model­free Monte Carlo estimator How does it work ?

27

The Model­free Monte Carlo estimator How does it work ?

28

The Model­free Monte Carlo estimator How does it work ?

The Model­free Monte Carlo estimator How does it work ?

30

The Model­free Monte Carlo estimator How does it work ?

31

The Model­free Monte Carlo estimator How does it work ?

MFMC estimator: analysis ●

Assumption: the functions f, ρ and h are Lipschitz continuous

33

MFMC estimator: analysis ●



The only information available on the system is gathered in a  sample of n one­step transitions

We define the random variable         as follows: a

The set of pairs                                           is arbitrary chosen, a

whereas the pairs   are determined by  where wl is drawn according to pW(.)  ●

         is a realization of the random set         .           34

MFMC estimator: analysis ●

Distance metric ∆



k­sparsity



                      denotes the distance of (x,u) to its k­th nearest  neighbor (using the distance ∆) in the sample 35

MFMC estimator: analysis The k­sparsity can be seen as the smallest  radius such that all ∆­balls in X×U contain at least k elements from

X

(x',u')

(x,u)

U

36

MFMC estimator: analysis ●

Expected value of the MFMC estimator



Theorem

37

MFMC estimator: analysis ●

Variance of the MFMC estimator



Theorem

38

Illustration ●

System



pW(.) is uniform over W, T = 15, x0 = ­0.5  .

39

Illustration ●

Simulations for p = 10, n = 100 … 10 000, uniform grid

n Model­free Monte Carlo estimator

Monte Carlo estimator

40

Illustration ●

Simulations for p = 1 … 100, n = 10 000, uniform grid

p Model­free Monte Carlo estimator

p Monte Carlo estimator

41

Conclusions and Future work Conclusions ●





We have proposed in this paper an estimator of the expected  return of a policy in a model­free setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC  estimator The bias and variance of the MFMC estimator converge to the bias  and variance of the MC estimator

Future work ●



MFMC estimator in a direct policy search framework One  could  extend  this  approach  to  evaluate  return  distributions  (and not only expected values). This could allow to develop ''safe''  policy search techniques based on Value at Risk (VaR) criteria. 42

Thank you

Modelfree Monte Carlolike Policy Evaluation - Orbi (ULg)

May 19, 2010 - Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a ...

3MB Sizes 0 Downloads 262 Views

Recommend Documents

Modelfree Monte Carlolike Policy Evaluation
Mar 30, 2010 - In Proceedings of The Thirteenth International Conference on Artificial Intelligence and. Statistics (AISTATS) 2010, JMLR W&CP 9, pp 217224, ...

Modelfree Monte Carlolike Policy Evaluation
Mar 30, 2010 - can be based on a Monte Carlo (MC) approach. ○. In this paper, the only information is contained in a sample of one step transitions of the system. ○. In this context, we propose a ``ModelFree Monte Carlo (MFMC) estimator'' of the

Model-free Monte Carlo–like policy evaluation - ORBi
Résumé : We propose an algorithm for estimating the finite-horizon expected return of a closed loop ... “on-policy” trajectories are generated by collecting information from the system when controlled by the given ... sample of one-step transit

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Batch Mode Reinforcement Learning based on the ... - Orbi (ULg)
Dec 10, 2012 - Theoretical Analysis. – Experimental Illustration ... data), marketing optimization (based on customers histories), finance, etc... Batch mode RL.

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Electrical Engineering and Computer Science Department. University of Liège, Belgium. November, 29th, 2013. Maastricht, The Nederlands ...

Model-Free Monte Carlo–like Policy Evaluation
defined by a random draw of n disturbance signals wl l = 1 ...n. The sample of .... 5 Illustration ... of nj considered in our experiments, the 50 values out- putted by ...

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

ABILITY AND EDUCATION IN THE POLICY EVALUATION ...
biases) and on the shape and variability of marginal returns to education. .... In the simplest model, we ignore option values of continuing studying after the level ...

ABILITY AND EDUCATION IN THE POLICY EVALUATION ...
choice with hindsight. The correct economic incentives are provided if the economic system allows individuals to have a sorting gain. In other words, the ...

DC grid - ORBi
every converter based on the frequency of the AC area it is connected to. Additionally, rather than directly controlling the power injections into the DC grid, as is done in [1], [2],. [3], [5], the new scheme controls the DC voltages of the converte

DC grid - ORBi
converges, after a change of load demands in the AC ar- .... mi) and of the variables (fi, Pmi, Pli, Pdc ... tions linking the voltage values to the other variables.

Inferring bounds on the performance of a control policy from a ... - ORBi
The main philosophy behind the proof is the follow- ing. First, a sequence of .... Athena Scientific, Belmont, MA, 2nd edition, 2005. [2] D.P. Bertsekas and J.N. ...

MPS Wellness Policy Evaluation Results 2016-2017.pdf
Minot Public Schools Wellness Policy. Evaluation Form. 18 responses. School Name (18 responses). Principal (18 responses). Date (7 responses). Bel air. Memorial. Lewis and Clark. Roosevelt Elementary School. Longfellow Elementary. Jim Hill Middle Sch

Transport policy evaluation in metropolitan areas: The ...
E-mail addresses: [email protected] (M. Hatzopoulou), ..... Columbia Greater Vancouver Transportation Authority Act in 1998. Translink is ...... Engineers and planners; the TDM coordinator has a marketing background.

Transport policy evaluation in metropolitan areas: The role ... - CiteSeerX
more complicated and data-intensive tools unless it is proven that the new model ...... elasticity models, and Cost Benefit Analysis or Cost Recovery Analysis). ... Many participants raised the issue that they are having a hard time finding skilled .

Transport policy evaluation in metropolitan areas: The role ... - CiteSeerX
In fact, the decentralization of the Canadian government structure, involving three levels of government ...... jobs that must be promoted (''For the master plan, we have developed a ..... Computers, Environment and Urban Systems 28, 9–44.

The Propensity Score method in public policy evaluation
When the data, the type of intervention and the assignment criterion allow it, a quasi- ... The evaluation of policies carried out by using quantitative tools is a tangible ..... the unobservable characteristics is also guaranteed if the size of the

Monte Beigua.pdf
Riccardo. Page 1 of 1. Monte Beigua.pdf. Monte Beigua.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Monte Beigua.pdf.

Inferring bounds on the performance of a control policy from a ... - ORBi
eralizations) of the following discrete-time optimal control problem arise quite frequently: a system, ... to high-enough cumulated rewards on the real system that is considered. In this paper, we thus focus on the evaluation of ... interactions with

Policy & Practice - External Evaluation 2015.pdf
Page 1 of 27. E v a l u a t i o n o f P o l i c y & P r a c t i c e. 1. A n e x t e r n a l e va l u a t i o n b y : Dr Seán Byers. Queen's University Belfast. June 2015.

Monte Carlo.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Monte Carlo.pdf.

Monte HS Menu.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. Monte HS Menu.pdf. Monte HS Menu.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying ...