Batch mode reinforcement learning based on the synthesis of artificial trajectories ! !

Apprentissage par renforcement batch fondé sur la reconstruction de trajectoires artificielles ! !

R. Fonteneau(1), Susan A. Murphy(2) , Louis Wehenkel(1) and Damien Ernst(1) (1)

University of Liège, Belgium

! ! ! !

(2)

University of Michigan, USA

May 12th, 2014 JFPDA'14 – Liège, Belgium

I’m happy to present this work here ! ! ! A synthesis of 5 years of research at the University of Liège (in collaboration with the University of Michigan) in the field of batch mode reinforcement learning

! « Batch mode reinforcement learning based on the synthesis of artificial trajectories », R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Annals of Operations Research, 208 (1), pp 383-416, 2013.

Outline ●

Batch Mode Reinforcement Learning

! ●



Batch Mode Reinforcement Learning



Objectives



Main Difficulties & Usual Approach



Remaining Challenges

!

A New Approach: Synthesizing Artificial Trajectories −

Formalization



Artificial Trajectories: What For?

!! ●

Conclusions

Batch Mode Reinforcement Learning

Batch Mode Reinforcement Learning 0

1

T

Time

1

?

p Patients

'optimal'
 treatment ?

Batch Mode Reinforcement Learning 0

1

T

Time

1

?

p Patients Batch collection of trajectories of patients

'optimal'
 treatment ?

Objectives ●

Main goal: Finding a "good" policy

Objectives ●

Main goal: Finding a "good" policy

! ! ! ! ! ! ●

Many associated subgoals:

Objectives ●

Main goal: Finding a "good" policy

! ! ! ! ! ! ●

Many associated subgoals: −

Evaluating the performance of a given policy

Objectives ●

Main goal: Finding a "good" policy

! ! ! ! ! ! ●

Many associated subgoals: −

Evaluating the performance of a given policy



Computing performance guarantees

Objectives ●

Main goal: Finding a "good" policy

! ! ! ! ! ! ●

Many associated subgoals: −

Evaluating the performance of a given policy



Computing performance guarantees



Computing safe policies

Objectives ●

Main goal: Finding a "good" policy

! ! ! ! ! ! ●

Many associated subgoals: −

Evaluating the performance of a given policy



Computing performance guarantees



Computing safe policies



Choosing how to generate additional transitions



...

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting:

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)



The state-space and/or the action space are large or continuous

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)



The state-space and/or the action space are large or continuous



The environment may be highly stochastic

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)



The state-space and/or the action space are large or continuous



The environment may be highly stochastic

! ●

Usual Approach:

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)



The state-space and/or the action space are large or continuous



The environment may be highly stochastic

! ●

Usual Approach: −

To combine dynamic programming with function approximators (neural networks, regression trees, SVM, linear regression over basis functions, etc)

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: −

Dynamics and reward functions are unknown (and not accessible to simulation)



The state-space and/or the action space are large or continuous



The environment may be highly stochastic

! ●

Usual Approach: −

To combine dynamic programming with function approximators (neural networks, regression trees, SVM, linear regression over basis functions, etc)



Function approximators have two main roles: ●



To offer a concise representation of state-action value function for deriving value / policy iteration algorithms To generalize information contained in the finite sample

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects:

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: −

hazardous generalization

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: −

hazardous generalization



difficulties to compute performance guarantees

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: −

hazardous generalization



difficulties to compute performance guarantees



inefficient use of optimal trajectories

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: −

hazardous generalization



difficulties to compute performance guarantees



inefficient use of optimal trajectories

! ●

A New Approach: Synthesizing Artificial Trajectories

A New Approach: Synthesizing Artificial Trajectories

Formalization Reinforcement learning ●

System dynamics:

Formalization Reinforcement learning ●

System dynamics:

Formalization Reinforcement learning ●

System dynamics:


 ●

Reward function:

Formalization Reinforcement learning ●

System dynamics:


 ●

Reward function:

! ●

Performance of a policy

! ! ! where

Formalization Batch mode reinforcement learning ●

The system dynamics, reward function and disturbance probability distribution are unknown

Formalization Batch mode reinforcement learning ●



The system dynamics, reward function and disturbance probability distribution are unknown Instead, we have access to a sample of one-step system transitions:

Formalization Artificial trajectories ●

Artificial trajectories are (ordered) sequences of elementary pieces of trajectories:

Artificial Trajectories: What For? ●

Artificial trajectories can help for: −

Estimating the performances of policies



Computing performance guarantees



Computing safe policies



Choosing how to generate additional transitions

Conclusions Stochastic setting
 
 MFMC: estimator of the expected return
 
 
 
 
 Bias / variance analysis 
 Illustration 
 


Continuous
 action space
 
 
 on
 Bounds 
 the return
 

 

 Convergence

Estimator
 of the
 VaR

Deterministic setting
 
 Finite action space
 
 
 
 CGRL
 
 
 Sampling
 
 
 
 Convergence
 strategy
 
 
 
 properties + additional 
 
 
 
 
 
 
 Illustration 
 Illustration 


References 
 
 
 
 
 
 "Batch mode reinforcement learning based on the synthesis of artificial trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Annals of Operations Research, 208 (1), 383-416, 2013. "Generating informative trajectories by using bounds on the return of control policies". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-page highlight paper, Chia Laguna, Sardinia, Italy, May 16, 2010. "Model-free Monte Carlo-like policy evaluation". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP 9, pp 217-224, Chia Laguna, Sardinia, Italy, May 13-15, 2010. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010. "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.
 
 Acknowledgements to F.R.S – FNRS for its financial support.

Batch mode reinforcement learning based on the ...

May 12, 2014 - Proceedings of the Workshop on Active Learning and Experimental Design ... International Conference on Artificial Intelligence and Statistics ...

1MB Sizes 2 Downloads 288 Views

Recommend Documents

Batch Mode Reinforcement Learning based on the ... - Orbi (ULg)
Dec 10, 2012 - Theoretical Analysis. – Experimental Illustration ... data), marketing optimization (based on customers histories), finance, etc... Batch mode RL.

Batch Mode Reinforcement Learning based on the ...
We give in Figure 1 an illustration of one such artificial trajectory. ..... 50 values computed by the MFMC estimator are concisely represented by a boxplot.

Batch mode reinforcement learning based on the ...
May 12, 2014 - "Model-free Monte Carlo-like policy evaluation". ... International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR ...

Batch Mode Reinforcement Learning based on the ...
Nov 29, 2012 - Reinforcement Learning (RL) aims at finding a policy maximizing received ... data), marketing optimization (based on customers histories), ...

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Contributions to Batch Mode Reinforcement Learning
Feb 24, 2011 - A new approach for computing bounds on the performances of control policies in batch mode RL. ✓ A min max approach to generalization in ...

Contributions to Batch Mode Reinforcement Learning
B Computing bounds for kernel–based policy evaluation in reinforcement learning. 171. B.1 Introduction ... a subproblem of reinforcement learning: computing a high-performance policy when the only information ...... to bracket the performance of th

bilateral robot therapy based on haptics and reinforcement learning
means of adaptable force fields. Patients: Four highly paretic patients with chronic stroke. (Fugl-Meyer score less than 15). Methods: The training cycle consisted ...

bilateral robot therapy based on haptics and reinforcement learning
Conclusion: Bilateral robot therapy is a promising tech- nique, provided that it ... From the 1Italian Institute of Technology, 2Neurolab, DIST, University of Genova and 3ART Education and Rehabilitation. Center, Genova, Italy. ..... Parlow SE, Dewey

Batch Mode Adaptive Multiple Instance Learning for ... - IEEE Xplore
positive bags, making it applicable for a variety of computer vision tasks such as action recognition [14], content-based image retrieval [28], text-based image ...

MeqTrees Batch Mode: A Short Tutorial - GitHub
tdlconf.profiles is where you save/load options using the buttons at ... Section is the profile name you supply ... around the Python interface (~170 lines of code).

Kernel-Based Models for Reinforcement Learning
cal results of Ormnoneit and Sen (2002) imply that, as the sample size grows, for every s ∈ D, the ... 9: until s is terminal. Note that in line 3 we compute the value ...

Asymptotic tracking by a reinforcement learning-based ... - Springer Link
Department of Mechanical and Aerospace Engineering, University of Florida, Gainesville, FL 32611, U.S.A.;. 2.Department of Physiology, University of Alberta, ...

Asymptotic tracking by a reinforcement learning-based ... - Springer Link
NASA Langley Research Center, Hampton, VA 23681, U.S.A.. Abstract: ... Keywords: Adaptive critic; Reinforcement learning; Neural network-based control.

Gradient-Based Relational Reinforcement-Learning of ...
concept language in which concepts and relations can be tem- poral. We evaluate our ...... ner that would plan from scratch for each episode, are slightly mis-.

Reinforcement Learning Trees
Feb 8, 2014 - size is small and prevents the effect of strong variables from being fully explored. Due to these ..... muting, which is suitable for most situations), and 50% ·|P\Pd ..... illustration of how this greedy splitting works. When there ar

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

Min Max Generalization for Deterministic Batch Mode ...
Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia ...

Min Max Generalization for Deterministic Batch Mode ...
Nov 29, 2013 - Formalization. ○. Deterministic dynamics: ○. Deterministic reward function: ○. Fixed initial state: ○. Continuous sate space, finite action space: ○. Return of a sequence of actions: ○. Optimal return: ...

Min Max Generalization for Deterministic Batch Mode ...
Sep 29, 2011 - University of Liège. Mini-workshop on Reinforcement Learning. Department of Electrical Engineering and Computer Science. University of ...

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.