Min Max Generalization for Deterministic Batch Mode Reinforcement Learning Raphael Fonteneau1,2 The work presented in this talk was done in collaboration with: Bernard Boigelot1, Damien Ernst1, Quentin Louveaux1 Susan A. Murphy3 Louis Wehenkel1 1

Department of EECS, University of Liège, Belgium Inria Lille – Nord Europe, France 3 Department of Statistics, University of Michigan, Ann Arbor, USA 2

SequeL – Inria Lille – Nord Europe, France – 15 juin 2012

Introduction

Goal

How to safely control a deterministic system living in a continuous state space given the knowledge of: - A batch collection of trajectories of the system, - The maximal variations of the system (upper bounds on the Lipschitz constants).

Author: ArcCan – Wikipedia Commons

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Introduction

Formalization ●

Deterministic dynamics:



Deterministic reward function:



Fixed initial state:



Continuous state space, finite action space:



Return of a sequence of actions:



Optimal return:

Introduction

The "batch" setting ●

Dynamics and reward function are unknown



For all actions



We note:

a set of one-step transitions is given:

Introduction

Lipschitz continuity ●

The system dynamics and the reward function are Lipschitz continuous :

were ●

is the Euclidean norm over the state space.

We assume that two constants known.

and

satisfying the above inequalities are

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

Introduction

Compatible environments ●

Compatible dynamics and reward functions:



Return of a sequence of actions under a compatible environment:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Our objective:

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Direct approach

Bounds from a sequence of transitions ●

For a given sequence of actions, one can compute from a sequence of one-step transitions which is compatible with the sequence of actions a lower-bound on the "worst possible return" of the sequence of actions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:



Such a bound is ''tight'' w.r.t. the dispersion of the batch collection of data:

Direct approach

The CGRL algorithm ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does.

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm



The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero.

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm





The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero, If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence,

Direct approach

The CGRL algorithm ●



One can define a best lower bound over all possible sequences of transitions:

Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm







The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero. If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence. Computational complexity: quadratic w.r.t the size of the batch collection of transitions.

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

CGRL

FQI (Fitted Q Iteration)

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

Information about the Puddle area is removed

CGRL

FQI (Fitted Q Iteration)

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Reformulation

Reformulation of min max problem

Reformulation

The two-stage case

Reformulation

Décomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

va ui Eq nt le

Reformulation

Decomposition

va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition N ha Prd va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition PN ha rd va ui Eq

So (c lo lv se d- e fo d rm )

rd

ha PN

ha P-

nt le

N

rd

Reformulation

ha PN rd

Reformulation

Building relaxation schemes ●





We focus on the following (NP-hard) problem:

We look for relaxation schemes that preserve the nature of min max generalization problem, i.e. offering performance guarantees We thus build relaxation schemes providing lower bounds on the return of the sequence of actions

Reformulation

Trust-region relaxation scheme ●

We keep one constraint of each type:

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Trust-region relaxation scheme

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Reformulation

Lagrangian relaxation

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Comparison

Direct approach vs Trust-region relaxation

Comparison

Direct approach vs Trust-region relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Synthesis

Comparison

Illustration ●

Dynamics:



Reward function:



Initial state:



Action space:



Grid generation:



100 batch collections of transitions drawn uniformly randomly

Comparison

Illustration Maximal bounds

Grid

Uniform sampling

Comparison

Illustration Returns

Grid

Uniform sampling

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Conclusions and future work

(non) Conclusion

ob Pr m le ill st un so e lv

d

Conclusions and future work

Future work

T-stage reformulation Stochastic case

Exact solution for small dimensions

? Infinite horizon

Associated publications ●







"Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes". R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. arXiv:1202.5298v1, 2012. "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010 "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.

Min Max Generalization for Deterministic Batch Mode ...

Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia ...

3MB Sizes 0 Downloads 301 Views

Recommend Documents

Min Max Generalization for Deterministic Batch Mode ...
Nov 29, 2013 - Formalization. ○. Deterministic dynamics: ○. Deterministic reward function: ○. Fixed initial state: ○. Continuous sate space, finite action space: ○. Return of a sequence of actions: ○. Optimal return: ...

Min Max Generalization for Deterministic Batch Mode ...
Sep 29, 2011 - University of Liège. Mini-workshop on Reinforcement Learning. Department of Electrical Engineering and Computer Science. University of ...

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Electrical Engineering and Computer Science Department. University of Liège, Belgium. November, 29th, 2013. Maastricht, The Nederlands ...

Relaxation Schemes for Min Max Generalization in ... - ORBi
Given a two-stage sequence of actions (u0,u1) ∈ U2, the two-stage version of the problem (PT (F,Lf ,Lρ,x0,u0,...,uT −1)) reads as follows: (P2(F,Lf ,Lρ,x0,u0,u1)) ...

Relaxation Schemes for Min Max Generalization in ... - ORBi
finite (discrete) action space U = {u(1),...,u(m)} that we abusively identify with {1,...,m}. T ∈ N \ {0} is referred to as the (finite) optimization horizon. An instantaneous reward rt = ρ (xt,ut) ∈ R is associated with the action ut taken whil

108.84 Min: 0 Max: 1384.67 Min: 0 Max: 1916.72 Min -
0. 30. 60. 90. ϕ1. Φ. Max: 0. 0◦. Min: 0. 0 30 60 90120150180210240270300330360. 0. 30. 60. 90. ϕ1. Φ. Max: 0. 5◦. Min: 0. 0 30 60 ...

MeqTrees Batch Mode: A Short Tutorial - GitHub
tdlconf.profiles is where you save/load options using the buttons at ... Section is the profile name you supply ... around the Python interface (~170 lines of code).

Upward Max Min Fairness - Research at Google
belong to the community and thus should be shared in a fair way among all ..... of flow values of large indices to increases of flow values of ...... Data Networks.

Min-Max Multiway Cut
in a distributed database system or a Peer-to-Peer system. ... files on a network, as well as other problems such as partitioning circuit .... We need to delete.

Batch Mode Adaptive Multiple Instance Learning for ... - IEEE Xplore
positive bags, making it applicable for a variety of computer vision tasks such as action recognition [14], content-based image retrieval [28], text-based image ...

Batch Mode Reinforcement Learning based on the ... - Orbi (ULg)
Dec 10, 2012 - Theoretical Analysis. – Experimental Illustration ... data), marketing optimization (based on customers histories), finance, etc... Batch mode RL.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Batch Mode Reinforcement Learning based on the ...
We give in Figure 1 an illustration of one such artificial trajectory. ..... 50 values computed by the MFMC estimator are concisely represented by a boxplot.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Contributions to Batch Mode Reinforcement Learning
Feb 24, 2011 - A new approach for computing bounds on the performances of control policies in batch mode RL. ✓ A min max approach to generalization in ...

Batch mode reinforcement learning based on the ...
May 12, 2014 - Proceedings of the Workshop on Active Learning and Experimental Design ... International Conference on Artificial Intelligence and Statistics ...

Batch mode reinforcement learning based on the ...
May 12, 2014 - "Model-free Monte Carlo-like policy evaluation". ... International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR ...

Contributions to Batch Mode Reinforcement Learning
B Computing bounds for kernel–based policy evaluation in reinforcement learning. 171. B.1 Introduction ... a subproblem of reinforcement learning: computing a high-performance policy when the only information ...... to bracket the performance of th

Batch Mode Reinforcement Learning based on the ...
Nov 29, 2012 - Reinforcement Learning (RL) aims at finding a policy maximizing received ... data), marketing optimization (based on customers histories), ...

All-optical integrated ternary MIN and MAX gate
Parallelism is the capability of the system to execute more than one operation .... number of programming input should increased. ▫ We can design ...

Multi-view Face Recognition with Min-Max Modular ... - Springer Link
Departmart of Computer Science and Engineering,. Shanghai Jiao ... we have proposed a min-max modular support vector machines (M3-SVMs) in our previous ...

Many-to-Many Matching with Max-Min Preferences
Nov 12, 2011 - weakly column-efficient matching is also defined in the same way. ... we denote singleton set {x} by x when there is no room for confusion.

BAI-TOAN-MAX-MIN-TOI-UU-2017.pdf
Sign in. Page. 1. /. 28. Loading… Page 1 of 28. Page 1 of 28. Page 2 of 28. Page 2 of 28. Page 3 of 28. Page 3 of 28. BAI-TOAN-MAX-MIN-TOI-UU-2017.pdf. BAI-TOAN-MAX-MIN-TOI-UU-2017.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying BAI-T