Min Max Generalization for Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes

Raphael Fonteneau with Damien Ernst, Bernard Boigelot, Quentin Louveaux Electrical Engineering and Computer Science Department University of Liège, Belgium November, 29th, 2013 Maastricht, The Nederlands

Goal

How to control a system so as to avoid the worst, given the knowledge of: - A batch of (random) trajectories - Maximal variations of the system, in the form of upper bounds on Lipschitz constants

NASA image - public domain – Wikipedia

A motivation: dynamic treatment regimes 0

1

T

Time

1

?

p Patients

'optimal' treatment ?

A motivation: dynamic treatment regimes 0

1

T

Time

1

?

p Patients Batch collection of trajectories of patients

'optimal' treatment ?

Formalization ●

Deterministic dynamics:



Deterministic reward function:



Fixed initial state:



Continuous sate space, finite action space:



Return of a sequence of actions:



Optimal return:

The "batch" mode setting Learning from trajectories ●

System dynamics and reward function are unnkown



For every action



Each set of transition is non-empty:



Define:

a set of transitions is known:

Lipschitz continuity Assumption about maximal variations ●

We assume that the system dynamics and reward function are Lipschitz continuous:

where ●

denotes the Euclidean norm over the state space

We also assume that two constants equations are known

and

satisfying the above

Min max generalization ●

One can define the sets of Lipschitz continuous functions compatible with the data:

and the return associated with a couple of fonctions taken in those two ensembles :

Min max generalization ●

One can then define:



And the solution of the min max generalization problem can be defined as follows:

Reformulation ●

According to previous research [1], we know that computing the optimal bound for a given sequence of actions can be reformalized as follows:

[1] "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011.

Reformulation ●

According to previous research [1], we know that computing the optimal bound for a given sequence of actions can be reformalized as follows: [1] proposes a lower bound on the optimal bound (computed independently from this reformulation) Here, we directly target this problem in order to find a bound tighter than [1]

[1] "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011.

Small simplification ●

One can show that type (3.3) constraints are redundant:



We can deduce the solution for time t=0 :

New problem

Complexity ●

One can show that such a problem is NP-hard



We propose relaxation schemes of polynomial complexity







We want those relaxation schemes to preserve the philosophy of the original problem, i.e., to provide lower bounds We propose two types of relaxations: –

The Intertwined Trust-Region (ITR) relaxation scheme



The Lagrangian relaxation scheme

We show that those relaxations are more efficient than previous solution given in [1]

Relaxation schemes

(I) Intertwined trust-region ●

First approach: remove constraints until the problem becomes polynomial

Only one constraint

(I) Intertwined trust-region ●

We get the ITR problem:



A closed-form solution of this problem can be obtained

(I) Intertwined trust-region

(I) Intertwined trust-region ●

The ITR problem can be solved for any selection of constraints



One can thus define a maximal ITR bound :

(II) Lagrangian relaxation



Polynomial complexity

Tightness of the bounds ●

Comparison with the relaxation proposed in [1] :

Tightness of the bounds ●

ITR versus [1] :

Sketch of proof : –

Compute the ITR relaxation with the constraints used by the CGRL bound

Tightness of the bounds ●

Lagrangian relaxation versus ITR :

Sketch of proof: –

Strong duality holds for the Lagrangian relaxation of the ITR problem

Tightness of the bounds ●



Synthesis:

All these bounds converge to the actual return of sequences of actions when the dispersion decreases towards zero

Illustration ●

Dynamics:



Reward function:



Initial state:



Decision space:



Grid :



100 samples of transitions drawn uniformly at random

Illustration Maximal bounds

Grid

Empirical average over random samples

Illustration Returns of sequences

Grid

Empirical average over random samples

Future work

Stochastic case

?

Computing policies

Exact solution

Infinite horizon

Min Max Generalization for Deterministic Batch Mode Reinforcement Learning: Relaxation Schemes. R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. SIAM Journal on Control and Optimization, Volume 51, Issue 5, pp 3355-3385, 2013.

! Advertisement !

The French Workshop on Planning, Decision Making and Learning 2014 will be in Liège, May 12-13, 2014 Hope to see you there!

http://sites.google.com/site/jfpda14/

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)

Electrical Engineering and Computer Science Department. University of Liège, Belgium. November, 29th, 2013. Maastricht, The Nederlands ...

2MB Sizes 1 Downloads 308 Views

Recommend Documents

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.

Min Max Generalization for Deterministic Batch Mode ...
Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia ...

Min Max Generalization for Deterministic Batch Mode ...
Nov 29, 2013 - Formalization. ○. Deterministic dynamics: ○. Deterministic reward function: ○. Fixed initial state: ○. Continuous sate space, finite action space: ○. Return of a sequence of actions: ○. Optimal return: ...

Min Max Generalization for Deterministic Batch Mode ...
Sep 29, 2011 - University of Liège. Mini-workshop on Reinforcement Learning. Department of Electrical Engineering and Computer Science. University of ...

Batch Mode Reinforcement Learning based on the ... - Orbi (ULg)
Dec 10, 2012 - Theoretical Analysis. – Experimental Illustration ... data), marketing optimization (based on customers histories), finance, etc... Batch mode RL.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Relaxation Schemes for Min Max Generalization in ... - ORBi
finite (discrete) action space U = {u(1),...,u(m)} that we abusively identify with {1,...,m}. T ∈ N \ {0} is referred to as the (finite) optimization horizon. An instantaneous reward rt = ρ (xt,ut) ∈ R is associated with the action ut taken whil

Relaxation Schemes for Min Max Generalization in ... - ORBi
Given a two-stage sequence of actions (u0,u1) ∈ U2, the two-stage version of the problem (PT (F,Lf ,Lρ,x0,u0,...,uT −1)) reads as follows: (P2(F,Lf ,Lρ,x0,u0,u1)) ...

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Modelfree Monte Carlolike Policy Evaluation - Orbi (ULg)
May 19, 2010 - Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a ...

108.84 Min: 0 Max: 1384.67 Min: 0 Max: 1916.72 Min -
0. 30. 60. 90. ϕ1. Φ. Max: 0. 0◦. Min: 0. 0 30 60 90120150180210240270300330360. 0. 30. 60. 90. ϕ1. Φ. Max: 0. 5◦. Min: 0. 0 30 60 ...

MeqTrees Batch Mode: A Short Tutorial - GitHub
tdlconf.profiles is where you save/load options using the buttons at ... Section is the profile name you supply ... around the Python interface (~170 lines of code).

Upward Max Min Fairness - Research at Google
belong to the community and thus should be shared in a fair way among all ..... of flow values of large indices to increases of flow values of ...... Data Networks.

Min-Max Multiway Cut
in a distributed database system or a Peer-to-Peer system. ... files on a network, as well as other problems such as partitioning circuit .... We need to delete.

Batch Mode Adaptive Multiple Instance Learning for ... - IEEE Xplore
positive bags, making it applicable for a variety of computer vision tasks such as action recognition [14], content-based image retrieval [28], text-based image ...

Batch Mode Reinforcement Learning based on the ...
We give in Figure 1 an illustration of one such artificial trajectory. ..... 50 values computed by the MFMC estimator are concisely represented by a boxplot.

Contributions to Batch Mode Reinforcement Learning
Feb 24, 2011 - A new approach for computing bounds on the performances of control policies in batch mode RL. ✓ A min max approach to generalization in ...

Batch mode reinforcement learning based on the ...
May 12, 2014 - Proceedings of the Workshop on Active Learning and Experimental Design ... International Conference on Artificial Intelligence and Statistics ...

Batch mode reinforcement learning based on the ...
May 12, 2014 - "Model-free Monte Carlo-like policy evaluation". ... International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR ...

Contributions to Batch Mode Reinforcement Learning
B Computing bounds for kernel–based policy evaluation in reinforcement learning. 171. B.1 Introduction ... a subproblem of reinforcement learning: computing a high-performance policy when the only information ...... to bracket the performance of th

Batch Mode Reinforcement Learning based on the ...
Nov 29, 2012 - Reinforcement Learning (RL) aims at finding a policy maximizing received ... data), marketing optimization (based on customers histories), ...

All-optical integrated ternary MIN and MAX gate
Parallelism is the capability of the system to execute more than one operation .... number of programming input should increased. ▫ We can design ...

Multi-view Face Recognition with Min-Max Modular ... - Springer Link
Departmart of Computer Science and Engineering,. Shanghai Jiao ... we have proposed a min-max modular support vector machines (M3-SVMs) in our previous ...

Many-to-Many Matching with Max-Min Preferences
Nov 12, 2011 - weakly column-efficient matching is also defined in the same way. ... we denote singleton set {x} by x when there is no room for confusion.