R. Fonteneau(1),(2) Joint work with Susan A. Murphy(3) , Louis Wehenkel(2) and Damien Ernst(2) (1)

Inria Lille – Nord Europe, France (2) University of Liège, Belgium (3) University of Michigan, USA

December 10th, 2012 CMS Winter Meeting – Montreal, Canada

Outline ●

●

●

●

Batch Mode Reinforcement Learning –

Reinforcement Learning

–

Batch Mode Reinforcement Learning

–

Objectives

–

Main Difficulties & Usual Approach

–

Remaining Challenges

A New Approach: Synthesizing Artificial Trajectories –

Formalization

–

Artificial Trajectories: What For?

Estimating the Performances of Policies –

Model-free Monte Carlo Estimation

–

The MFMC Algorithm

–

Theoretical Analysis

–

Experimental Illustration

Conclusions

Batch Mode Reinforcement Learning

Reinforcement Learning Environment

Agent

Actions

Observations, Rewards

Examples of rewards:

●

Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment

Batch Mode Reinforcement Learning ●

All the available information is contained in a batch collection of data

●

Batch mode RL aims at computing a (near-)optimal policy from this collection of data

Agent

Environment Actions Batch mode RL Observations, Rewards

Finite collection of trajectories of the agent

●

Near-optimal decision strategy

Examples of BMRL problems: dynamic treatment regimes (inferred from clinical data), marketing optimization (based on customers histories), finance, etc...

Batch Mode Reinforcement Learning 0

1

T

Time

1

?

p Patients

'optimal' treatment ?

Batch Mode Reinforcement Learning 0

1

T

Time

1

?

p Patients Batch collection of trajectories of patients

'optimal' treatment ?

Objectives ●

Main goal: Finding a "good" policy

Objectives ●

Main goal: Finding a "good" policy

●

Many associated subgoals:

Objectives ●

Main goal: Finding a "good" policy

●

Many associated subgoals: –

Evaluating the performance of a given policy

Objectives ●

Main goal: Finding a "good" policy

●

Many associated subgoals: –

Evaluating the performance of a given policy

–

Computing performance guarantees

Objectives ●

Main goal: Finding a "good" policy

●

Many associated subgoals: –

Evaluating the performance of a given policy

–

Computing performance guarantees

–

Computing safe policies

Objectives ●

Main goal: Finding a "good" policy

●

Many associated subgoals: –

Evaluating the performance of a given policy

–

Computing performance guarantees

–

Computing safe policies

–

Choosing how to generate additional transitions

–

...

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting:

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

–

The state-space and/or the action space are large or continuous

Main Difficulties & Usual Approach ●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

–

The state-space and/or the action space are large or continuous

–

The environment may be highly stochastic

Main Difficulties & Usual Approach ●

●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

–

The state-space and/or the action space are large or continuous

–

The environment may be highly stochastic

Usual Approach:

Main Difficulties & Usual Approach ●

●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

–

The state-space and/or the action space are large or continuous

–

The environment may be highly stochastic

Usual Approach: –

To combine dynamic programming with function approximators (neural networks, regression trees, SVM, linear regression over basis functions, etc)

Main Difficulties & Usual Approach ●

●

Main difficulties of the batch mode setting: –

Dynamics and reward functions are unknown (and not accessible to simulation)

–

The state-space and/or the action space are large or continuous

–

The environment may be highly stochastic

Usual Approach: –

To combine dynamic programming with function approximators (neural networks, regression trees, SVM, linear regression over basis functions, etc)

–

Function approximators have two main roles: ●

●

To offer a concise representation of state-action value function for deriving value / policy iteration algorithms To generalize information contained in the finite sample

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects:

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: –

hazardous generalization

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: –

hazardous generalization

–

difficulties to compute performance guarantees

Remaining Challenges ●

The black box nature of function approximators may have some unwanted effects: –

hazardous generalization

–

difficulties to compute performance guarantees

–

unefficient use of optimal trajectories

Remaining Challenges ●

●

The black box nature of function approximators may have some unwanted effects: –

hazardous generalization

–

difficulties to compute performance guarantees

–

unefficient use of optimal trajectories

A New Approach: Synthesizing Artificial Trajectories

A New Approach: Synthesizing Artificial Trajectories

Formalization Reinforcement learning ●

System dynamics:

Formalization Reinforcement learning ●

System dynamics:

Formalization Reinforcement learning ●

System dynamics:

●

Reward function:

Formalization Reinforcement learning ●

System dynamics:

●

Reward function:

●

Performance of a policy

where

Formalization Batch mode reinforcement learning ●

The system dynamics, reward function and disturbance probability distribution are unknown

Formalization Batch mode reinforcement learning ●

●

The system dynamics, reward function and disturbance probability distribution are unknown Instead, we have access to a sample of one-step system transitions:

Formalization Artificial trajectories ●

Artificial trajectories are (ordered) sequences of elementary pieces of trajectories:

Artificial Trajectories: What For? ●

Artificial trajectories can help for: –

Estimating the performances of policies

–

Computing performance guarantees

–

Computing safe policies

–

Choosing how to generate additional transitions

Artificial Trajectories: What For? ●

Artificial trajectories can help for: –

Estimating the performances of policies

–

Computing performance guarantees

–

Computing safe policies

–

Choosing how to generate additional transitions

Estimating the Performances of Policies

Model-free Monte Carlo Estimation ●

If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h

Model-free Monte Carlo Estimation MODEL OR SIMULATOR REQUIRED!

Model-free Monte Carlo Estimation ●

●

If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions

Model-free Monte Carlo Estimation ●

●

●

If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once

Model-free Monte Carlo Estimation ●

●

●

●

If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once We average the cumulated returns over the p artificial trajectories to obtain the Model-free Monte Carlo estimator (MFMC) of the expected return of h:

Model-free Monte Carlo Estimation

The MFMC algorithm Example with T = 3, p = 2, n = 8

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

The MFMC algorithm

Theoretical Analysis Assumptions ●

Lipschitz continuity assumptions:

Theoretical Analysis Assumptions ●

Distance metric ∆

●

k-sparsity

●

denotes the distance of (x,u) to its k-th nearest neighbor (using the distance ∆) in the sample

Theoretical Analysis Assumptions ●

The k-sparsity can be seen as the smallest radius such that all ∆-balls in X×U contain at least k elements from

Theoretical Analysis Theoretical results ●

Expected value of the MFMC estimator

Theoretical Analysis Theoretical results ●

Expected value of the MFMC estimator

●

Theorem

with

Theoretical Analysis Theoretical results ●

Variance of the MFMC estimator

Theoretical Analysis Theoretical results ●

Variance of the MFMC estimator

●

Theorem

with

Experimental Illustration Benchmark ●

Dynamics:

●

Reward function:

●

Policy to evaluate:

●

Other information:

pW(.) is uniform,

Experimental Illustration Influence of n ●

Simulations for p = 10, n = 100 … 10 000, uniform grid, T = 15, x0 = - 0.5 . Model-free Monte Carlo estimator

n = 100 … 10 000, p = 10

Monte Carlo estimator

p = 10

Experimental Illustration Influence of p ●

Simulations for p = 1 … 100, n = 10 000 , uniform grid, T = 15, x0 = - 0.5 . Model-free Monte Carlo estimator

p = 1 … 100, n=10 000

Monte Carlo estimator

p = 1 … 100

Experimental Illustration MFMC vs FQI-PE ●

Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .

Experimental Illustration MFMC vs FQI-PE ●

Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .

Conclusions Stochastic setting MFMC: estimator of the expected return Bias / variance analysis

Illustration

Estimator of the VaR

Deterministic setting Continuous action space

Finite action space

Bounds on the return

CGRL Convergence + additional properties

Sampling strategy

Convergence

Illustration

Illustration

Conclusions Stochastic setting MFMC: estimator of the expected return Bias / variance analysis

Illustration

Estimator of the VaR

Deterministic setting Continuous action space

Finite action space

Bounds on the return

CGRL Convergence + additional properties

Sampling strategy

Convergence

Illustration

Illustration

References

"Batch mode reinforcement learning based on the synthesis of artificial trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. To appear in Annals of Operations Research, 2012. "Generating informative trajectories by using bounds on the return of control policies". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-page highlight paper, Chia Laguna, Sardinia, Italy, May 16, 2010. "Model-free Monte Carlo-like policy evaluation". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP 9, pp 217-224, Chia Laguna, Sardinia, Italy, May 13-15, 2010. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010. "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009. Acknowledgements to F.R.S – FNRS for its financial support.

Appendix

Estimating the Performances of Policies Risk-sensitive criterion

●

Consider again the p artificial trajectories that were rebuilt by the MFMC estimator

●

The Value-at-Risk of the policy h

can be straightforwardly estimated as follows:

with

Deterministic Case: Computing Bounds Bounds from a Single Trajectory

●

Given an artificial trajectory :

Deterministic Case: Computing Bounds Bounds from a Single Trajectory

●

Let

with

Proposition: be an artificial trajectory. Then,

Deterministic Case: Computing Bounds Maximal Bounds

●

Maximal lower and upper-bounds

Deterministic Case: Computing Bounds Tightness of Maximal Bounds

●

Proposition:

Inferring Safe Policies From Lower Bounds to Cautious Policies ●

Consider the set of open-loop policies:

●

For such policies, bounds can be computed in a similar way

●

●

We can then search for a specific policy for which the associated lower bound is maximized:

A O( T n ² ) algorithm for doing this: the CGRL algorithm (Cautious approach to Generalization in RL)

Inferring Safe Policies Convergence

●

Theorem

Inferring Safe Policies Experimental Results ●

The puddle world benchmark

Inferring Safe Policies Experimental Results CGRL

The state space is uniformly covered by the sample

Information about the Puddle area is removed

FQI (Fitted Q Iteration)

Inferring Safe Policies Bonus

●

Theorem

Sampling Strategies An Artificial Trajectories Viewpoint ●

Given a sample of system transitions

How can we determine where to sample additional transitions ? ●

We define the set of candidate optimal policies:

●

A transition

and we denote by

is said compatible with

the set of all such compatible transitions.

if

Sampling Strategies An Artificial Trajectories Viewpoint ●

Iterative scheme:

with

●

Conjecture:

Sampling Strategies Illustration ●

Action space:

●

Dynamics and reward function:

●

Horizon:

●

Initial sate:

●

Total number of policies:

●

Number of transitions needed for discriminating:

Connexion to Classic Batch Mode RL Towards a New Paradigm for Batch Mode RL ●

FQI (evaluation mode) with k-NN:

l

1,1

l 1,2 l l

l

1

l

k

1,k

l l

2

l

k,1

l

k,2

l

1,1,. .. ,1

k,k

2,1 2,2

l 2,k l

l

k , 2,1 k , 2,2

l k , 2,k l l

k , k ,... ,k

Connexion to Classic Batch Mode RL Towards a New Paradigm for Batch Mode RL ●

The k-NN FQI-PE algorithm:

●

The k-NN FQI-PE estimator: