University of Liège – University of Michigan
Modelfree Monte Carlolike Policy Evaluation
Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst May, 19th 2010
CAp 2010, ClermontFerrand, France
0
1
T
Time
Evaluation
1
n Patients Therapy to evaluate
?
Outline Introduction Problem statement The Monte Carlo estimator The ModelFree Monte Carlo estimator MFMC estimator: analysis Illustration Conclusions and future work
3
Introduction ●
●
●
●
●
Discretetime stochastic optimal control problems arise in many fields (finance, medecine, engineering,...) Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a (near)optimal control policy When the system is accessible to experimentation, such an oracle can be based on a Monte Carlo (MC) approach In this paper, the only information is contained in a sample of one step transitions of the system In this context, we propose a ModelFree Monte Carlo (MFMC) estimator of the performance of a given policy that mimics in some way the Monte Carlo estimator. 4
Problem statement ●
●
We consider a discretetime system whose dynamics over T stages is given by
All xt lie in a normed state space X, all ut lie in a normed action space U, wt are i.i.d. according to a probability distribution pW(.)
●
●
An instantaneous reward action ut while being in state xt
is associated with the
A policy h: {0,...,T1} × X U is given, and we want to evaluate its performance.
5
Problem statement ●
The expected return of the policy h when starting from an initial state x0 is given by
where
with
w0
x0
r0
x1
w1
w T −2
r1
r T −2
x2
x T−2
w T −1
x T−1
r T −1
xT
Problem statement
●
Problem: the functions f, ρ and pW(.) are unknown
●
They are replaced by a sample of system transitions
where the pairs are determined by according to pW(.)
are arbitrary chosen and the pairs , where wl is drawn
How to evaluate Jh(x0) in this context? 7
The Monte Carlo estimator ●
We define the Monte Carlo estimator of the expected return of h when starting from the initial state x0:
with
8
The Monte Carlo estimator w w 10 r
w
x0
w 0p
r
r 1T −2
x 1T −2
x 12
x
2 1
r
2 1
r
2 x 2T −2 r T −2
x 22
1 r ∑ t
x
1 T −1
1 T
w 2T −1
x 2T −1
T −1
r 2T −1 x 2 T
t =0
MC Estimator
p 1
w 1p
w Tp −2
r 1p
r Tp −2
x
p 2
x
p T −2
1 i r ∑∑ p i=1 t=0 t p
x T −1
w Tp −1 r
p T −1
T −1
p
xT w ~ pW .
∑ r 2t p T −1
x
i t
t =0
x 1T −1
w 2T −2
w 21
2 0
r 20 r 0p
x
1 0
1 1
w
w 1T −2
1 1 1 1
T −1
1 T −1
p r ∑ t t =0
9
The Monte Carlo estimator ●
●
We assume that the random variable Rh(x0) admits a finite variance
The bias and variance of the Monte Carlo estimator are
10
The Modelfree Monte Carlo estimator ●
●
●
●
●
●
Here, the MC approach is not feasible, since the system is unknown We introduce the ModelFree Monte Carlo estimator From the sample of transitions, we build p sequences of different transitions of length T called ``broken trajectories'' These broken trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h We average the cumulated returns over the p broken trajectories to compute an estimate of the expected return of h The algorithm has complexity O(npT) . 11
The Modelfree Monte Carlo estimator
12
The Modelfree Monte Carlo estimator How does it work ? Example with T=3, p=2, n=8
13
The Modelfree Monte Carlo estimator How does it work ?
14
The Modelfree Monte Carlo estimator How does it work ?
15
The Modelfree Monte Carlo estimator How does it work ?
16
The Modelfree Monte Carlo estimator How does it work ?
The Modelfree Monte Carlo estimator How does it work ?
18
The Modelfree Monte Carlo estimator How does it work ?
19
The Modelfree Monte Carlo estimator How does it work ?
The Modelfree Monte Carlo estimator How does it work ?
21
The Modelfree Monte Carlo estimator How does it work ?
22
The Modelfree Monte Carlo estimator How does it work ?
The Modelfree Monte Carlo estimator How does it work ?
24
The Modelfree Monte Carlo estimator How does it work ?
25
The Modelfree Monte Carlo estimator How does it work ?
The Modelfree Monte Carlo estimator How does it work ?
27
The Modelfree Monte Carlo estimator How does it work ?
28
The Modelfree Monte Carlo estimator How does it work ?
The Modelfree Monte Carlo estimator How does it work ?
30
The Modelfree Monte Carlo estimator How does it work ?
31
The Modelfree Monte Carlo estimator How does it work ?
MFMC estimator: analysis ●
Assumption: the functions f, ρ and h are Lipschitz continuous
33
MFMC estimator: analysis ●
●
The only information available on the system is gathered in a sample of n onestep transitions
We define the random variable as follows: a
The set of pairs is arbitrary chosen, a
whereas the pairs are determined by where wl is drawn according to pW(.) ●
is a realization of the random set . 34
MFMC estimator: analysis ●
Distance metric ∆
●
ksparsity
●
denotes the distance of (x,u) to its kth nearest neighbor (using the distance ∆) in the sample 35
MFMC estimator: analysis The ksparsity can be seen as the smallest radius such that all ∆balls in X×U contain at least k elements from
X
(x',u')
(x,u)
U
36
MFMC estimator: analysis ●
Expected value of the MFMC estimator
●
Theorem
37
MFMC estimator: analysis ●
Variance of the MFMC estimator
●
Theorem
38
Illustration ●
System
●
pW(.) is uniform over W, T = 15, x0 = 0.5 .
39
Illustration ●
Simulations for p = 10, n = 100 … 10 000, uniform grid
n Modelfree Monte Carlo estimator
Monte Carlo estimator
40
Illustration ●
Simulations for p = 1 … 100, n = 10 000, uniform grid
p Modelfree Monte Carlo estimator
p Monte Carlo estimator
41
Conclusions and Future work Conclusions ●
●
●
We have proposed in this paper an estimator of the expected return of a policy in a modelfree setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC estimator The bias and variance of the MFMC estimator converge to the bias and variance of the MC estimator
Future work ●
●
MFMC estimator in a direct policy search framework One could extend this approach to evaluate return distributions (and not only expected values). This could allow to develop ''safe'' policy search techniques based on Value at Risk (VaR) criteria. 42
Thank you