Department of EECS, University of Liège, Belgium Inria Lille – Nord Europe, France 3 Department of Statistics, University of Michigan, Ann Arbor, USA 2

SequeL – Inria Lille – Nord Europe, France – 15 juin 2012

Introduction

Goal

How to safely control a deterministic system living in a continuous state space given the knowledge of: - A batch collection of trajectories of the system, - The maximal variations of the system (upper bounds on the Lipschitz constants).

Author: ArcCan – Wikipedia Commons

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Introduction

Formalization ●

Deterministic dynamics:

●

Deterministic reward function:

●

Fixed initial state:

●

Continuous state space, finite action space:

●

Return of a sequence of actions:

●

Optimal return:

Introduction

The "batch" setting ●

Dynamics and reward function are unknown

●

For all actions

●

We note:

a set of one-step transitions is given:

Introduction

Lipschitz continuity ●

The system dynamics and the reward function are Lipschitz continuous :

were ●

is the Euclidean norm over the state space.

We assume that two constants known.

and

satisfying the above inequalities are

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

●

Return of a sequence of actions under a compatible environment:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Our objective:

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Direct approach

Bounds from a sequence of transitions ●

For a given sequence of actions, one can compute from a sequence of one-step transitions which is compatible with the sequence of actions a lower-bound on the "worst possible return" of the sequence of actions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:

●

Such a bound is ''tight'' w.r.t. the dispersion of the batch collection of data:

Direct approach

The CGRL algorithm ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

The CGRL algorithm ●

●

One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does.

Direct approach

The CGRL algorithm ●

●

One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero.

Direct approach

The CGRL algorithm ●

●

One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm

●

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero, If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence,

Direct approach

The CGRL algorithm ●

●

One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm

●

●

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero. If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence. Computational complexity: quadratic w.r.t the size of the batch collection of transitions.

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

CGRL

FQI (Fitted Q Iteration)

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

Information about the Puddle area is removed

CGRL

FQI (Fitted Q Iteration)

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Reformulation

Reformulation of min max problem

Reformulation

The two-stage case

Reformulation

Décomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

va ui Eq nt le

Reformulation

Decomposition

va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition N ha Prd va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition PN ha rd va ui Eq

So (c lo lv se d- e fo d rm )

rd

ha PN

ha P-

nt le

N

rd

Reformulation

ha PN rd

Reformulation

Building relaxation schemes ●

●

●

We focus on the following (NP-hard) problem:

We look for relaxation schemes that preserve the nature of min max generalization problem, i.e. offering performance guarantees We thus build relaxation schemes providing lower bounds on the return of the sequence of actions

Reformulation

Trust-region relaxation scheme ●

We keep one constraint of each type:

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Comparison

Direct approach vs Trust-region relaxation

Comparison

Direct approach vs Trust-region relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Synthesis

Comparison

Illustration ●

Dynamics:

●

Reward function:

●

Initial state:

●

Action space:

●

Grid generation:

●

100 batch collections of transitions drawn uniformly randomly

Comparison

Illustration Maximal bounds

Grid

Uniform sampling

Comparison

Illustration Returns

Grid

Uniform sampling

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Conclusions and future work

(non) Conclusion

ob Pr m le ill st un so e lv

d

Conclusions and future work

Future work

T-stage reformulation Stochastic case

Exact solution for small dimensions

? Infinite horizon

Associated publications ●

●

●

●

"Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes". R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. arXiv:1202.5298v1, 2012. "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010 "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.