Min Max Generalization for Deterministic Batch Mode Reinforcement Learning Raphael Fonteneau1,2 The work presented in this talk was done in collaboration with: Bernard Boigelot1, Damien Ernst1, Quentin Louveaux1 Susan A. Murphy3 Louis Wehenkel1 1

Department of EECS, University of Liège, Belgium Inria Lille – Nord Europe, France 3 Department of Statistics, University of Michigan, Ann Arbor, USA 2

SequeL – Inria Lille – Nord Europe, France – 15 juin 2012

Introduction

Goal

How to safely control a deterministic system living in a continuous state space given the knowledge of: - A batch collection of trajectories of the system, - The maximal variations of the system (upper bounds on the Lipschitz constants).

Author: ArcCan – Wikipedia Commons

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Introduction

Formalization ●

Deterministic dynamics:



Deterministic reward function:



Fixed initial state:



Continuous state space, finite action space:



Return of a sequence of actions:



Optimal return:

Introduction

The "batch" setting ●

Dynamics and reward function are unknown



For all actions



We note:

a set of one-step transitions is given:

Introduction

Lipschitz continuity ●

The system dynamics and the reward function are Lipschitz continuous :

were ●

is the Euclidean norm over the state space.

We assume that two constants known.

and

satisfying the above inequalities are

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

Introduction

Compatible environments ●

Compatible dynamics and reward functions:



Return of a sequence of actions under a compatible environment:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Our objective:

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Direct approach

Bounds from a sequence of transitions ●

For a given sequence of actions, one can compute from a sequence of one-step transitions which is compatible with the sequence of actions a lower-bound on the "worst possible return" of the sequence of actions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:



Such a bound is ''tight'' w.r.t. the dispersion of the batch collection of data:

Direct approach

The CGRL algorithm ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does.

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm



The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero.

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm





The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero, If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence,

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm







The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero. If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence. Computational complexity: quadratic w.r.t the size of the batch collection of transitions.

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

CGRL

FQI (Fitted Q Iteration)

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

Information about the Puddle area is removed

CGRL

FQI (Fitted Q Iteration)

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Reformulation

Reformulation of min max problem

Reformulation

The two-stage case

Reformulation

Décomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

va ui Eq nt le

Reformulation

Decomposition

va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition N ha Prd va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition PN ha rd va ui Eq

So (c lo lv se d- e fo d rm )

rd

ha PN

ha P-

nt le

N

rd

Reformulation

ha PN rd

Reformulation

Building relaxation schemes ●





We focus on the following (NP-hard) problem:

We look for relaxation schemes that preserve the nature of min max generalization problem, i.e. offering performance guarantees We thus build relaxation schemes providing lower bounds on the return of the sequence of actions

Reformulation

Trust-region relaxation scheme ●

We keep one constraint of each type:

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Comparison

Direct approach vs Trust-region relaxation

Comparison

Direct approach vs Trust-region relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Synthesis

Comparison

Illustration ●

Dynamics:



Reward function:



Initial state:



Action space:



Grid generation:



100 batch collections of transitions drawn uniformly randomly

Comparison

Illustration Maximal bounds

Grid

Uniform sampling

Comparison

Illustration Returns

Grid

Uniform sampling

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Conclusions and future work

(non) Conclusion

ob Pr m le ill st un so e lv

d

Conclusions and future work

Future work

T-stage reformulation Stochastic case

Exact solution for small dimensions

? Infinite horizon

Associated publications ●







"Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes". R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. arXiv:1202.5298v1, 2012. "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010 "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.

Min Max Generalization for Deterministic Batch Mode ...

Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia ...

3MB Sizes 0 Downloads 111 Views

Recommend Documents

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.

MeqTrees Batch Mode: A Short Tutorial - GitHub
tdlconf.profiles is where you save/load options using the buttons at ... Section is the profile name you supply ... around the Python interface (~170 lines of code).

Min-Max Multiway Cut
in a distributed database system or a Peer-to-Peer system. ... files on a network, as well as other problems such as partitioning circuit .... We need to delete.

Upward Max Min Fairness - Research at Google
belong to the community and thus should be shared in a fair way among all ..... of flow values of large indices to increases of flow values of ...... Data Networks.

Batch Mode Reinforcement Learning based on the ...
We give in Figure 1 an illustration of one such artificial trajectory. ..... 50 values computed by the MFMC estimator are concisely represented by a boxplot.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Many-to-Many Matching with Max-Min Preferences
Nov 12, 2011 - weakly column-efficient matching is also defined in the same way. ... we denote singleton set {x} by x when there is no room for confusion.

All-optical integrated ternary MIN and MAX gate
Email: [email protected] 19th West Bengal State Science & Technology .... Bulk optical PBLU. P. Q. R. S. B. B. AB. 1. 2. 3. A. A. W. X. Y. Z. O1. O2. 4. 5.