University of Liège, Belgium

Contributions to Batch Mode Reinforcement Learning PhD Defense Raphael Fonteneau February 24th , 2011

Many thanks to the members of the jury: Dr. Remi Munos, INRIA, France Prof. Susan A. Murphy, University of Michigan, USA Dr. Michèle Sebag, INRIA, France Dr. Damien Ernst, University of Liège (advisor) Prof. Quentin Louveaux, University of Liège (president of the jury) Prof. Rodolphe Sépulchre, University of Liège Prof. Louis Wehenkel, University of Liège (co-advisor)

Introduction

Reinforcement Learning Environment

Agent

Actions

Observations, Rewards

Examples of rewards:



Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment

3

Batch Mode Reinforcement Learning ●

All the available information is contained in a batch collection of data



Batch mode RL aims at computing a (near-)optimal policy from this collection of data

Agent

Environment Actions Batch mode RL Observations, Rewards

Finite collection of trajectories of the agent

(near-)optimal policy

4

Dynamic Treatment Regimes 0

1

T

Time

1

?

p Patients

'optimal' treatment ?

Dynamic Treatment Regimes 0

1

T

Time

1

?

'optimal' treatment ?

p Patients Batch collection of trajectories of patients

6

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●

Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●



Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●





Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data The design of DTRs has to take into consideration the fact that treatments should be based on a limited number of clinical indicators to be more convenient

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●







Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data The design of DTRs has to take into consideration the fact that treatments should be based on a limited number of clinical indicators to be more convenient Clinical data gathered from experimental protocols may be highly noisy or incomplete

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●









Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data The design of DTRs has to take into consideration the fact that treatments should be based on a limited number of clinical indicators to be more convenient Clinical data gathered from experimental protocols may be highly noisy or incomplete Confounding issues and partial observability occur frequently when dealing with specific types of chronic-like diseases (psychotic diseases)

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●









Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data The design of DTRs has to take into consideration the fact that treatments should be based on a limited number of clinical indicators to be more convenient Clinical data gathered from experimental protocols may be highly noisy or incomplete Confounding issues and partial observability occur frequently when dealing with specific types of chronic-like diseases (psychotic diseases)



Preference elicitation: how to evaluate the well-being of a patient ?



... 13

Challenges ●

Batch mode RL appears to be a very promising approach for computing DTRs

However ●









Medical applications expect performance guarantees: these are usually not provided by batch mode RL algorithms There is almost no technique for generating informative batch collections of data The design of DTRs has to take into consideration the fact that treatments should be based on a limited number of clinical indicators to be more convenient Clinical data gathered from experimental protocols may be highly noisy or incomplete Confounding issues and partial observability occur frequently when dealing with specific types of chronic-like diseases (psychotic diseases)



Preference elicitation: how to evaluate the well-being of a patient ?



... 14

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL ✔ A min max approach to generalization in batch mode RL, and a new batch mode RL algorithm for inferring control policies that show cautious properties in adversarial environments

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL ✔ A min max approach to generalization in batch mode RL, and a new batch mode RL algorithm for inferring control policies that show cautious properties in adversarial environments ✔ Two new sampling strategies for generating informative collections of data

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL ✔ A min max approach to generalization in batch mode RL, and a new batch mode RL algorithm for inferring control policies that show cautious properties in adversarial environments ✔ Two new sampling strategies for generating informative collections of data ✔ A new batch mode estimator of the expected performances of control policies in stochastic environments

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL ✔ A min max approach to generalization in batch mode RL, and a new batch mode RL algorithm for inferring control policies that show cautious properties in adversarial environments ✔ Two new sampling strategies for generating informative collections of data ✔ A new batch mode estimator of the expected performances of control policies in stochastic environments ✔ A variable selection technique for batch mode RL for computing control policies based on smaller subsets of variables

Contributions ✔ A new approach for computing bounds on the performances of control policies in batch mode RL ✔ A min max approach to generalization in batch mode RL, and a new batch mode RL algorithm for inferring control policies that show cautious properties in adversarial environments ✔ Two new sampling strategies for generating informative collections of data ✔ A new batch mode estimator of the expected performances of control policies in stochastic environments ✔ A variable selection technique for batch mode RL for computing control policies based on smaller subsets of variables Common technical difficulties for all these contributions: ●



State and/or action space(s) is (are) continuous The only information available on the system dynamics and the reward function is given in the form of a (finite) set of one-step system transition 20

First contribution Bounds performance

Bounds performance ●

We consider a deterministic discrete-time system whose dynamics over T stages is given by the time-invariant equation:

where all x lie in a normed state space X , and u in a normed action space U . t

t



The transition from time t to t+1 is associated with an instantaneous reward



We consider a deterministic time-varying policy h

22

Bounds performance ●

The return over T stages of the policy h when starting from an initial state x is given by



The system dynamics and the reward functions are unknown



They are replaced by a sample of n system transitions

0

where 23

Bounds performance Problem ●

Since the system dynamics and the reward function are unknown, we cannot compute exactly the return of h:

How can we compute bounds on this return ?

24

Bounds performance Assumptions ●

The system dynamics, reward function, and policy h are Lipschitz continuous:

where



Three constants satisfying the above equations are known

25

Bounds performance ●

Consider a sequence of T system transitions:

Bounds performance ●

Consider a sequence of T system transitions:

Bounds performance ●

Consider a sequence of T system transitions:

Bounds performance ●

Consider a sequence of T system transitions:

29

Bounds performance Theorem (Lower bound computed from a sequence of transitions)

where

30

Bounds performance ●



Let us define:

The theorem is still valid for any sequence of transitions, a fortiori for a sequence maximizing the previous bound:

31

Bounds performance Theorem (Tightness of maximal lower bound) Assume that

and let be the smallest constant that satisfies the previous inequality (this parameter is called sparsity in the following). Then,

32

Second contribution Min max generalization

Min max generalization ●

The action space is assumed to be finite



The Lipschitz continuity assumptions are expressed as follows:

and two constants satisfying the above equations are known ●

The system dynamics and the reward functions are still unknown and replaced by a sample of n system transitions

34

Min max generalization Definition (Compatible environments) We define the sets of functions that are compatible with the sample of transitions and the Lipschitz constants:

35

Min max generalization Problem ●

We want to evaluate, for a given control policy (u0,...,uT­1), and and initial state x0, which is the worst possible return

where



Then, one could derive a "max min" solution of the optimal control problem: 36

Min max generalization Theorem (Reformulation)

Under the constraints:

Then,

37

Min max generalization ●

The previous optimization problem is non convex, NP-hard, and its resolution is left for future works

Min max generalization ●



The previous optimization problem is non convex, NP-hard, and its resolution is left for future works We propose to adapt the technique presented in the first contribution to compute a lower bound:

Min max generalization ●





The previous optimization problem is non convex, NP-hard, and its resolution is left for future works We propose to adapt the technique presented in the first contribution to compute a lower bound:

We propose an algorithm called CGRL (Cautious approach to Generalization in RL) for deriving a control policy that maximizes the previous lower bound:

40

Min max generalization Theorem (Convergence of the CGRL solution) When the sparsity of the sample of system transitions decreases below a particular threshold, then the sequence of actions

computed by the CGRL algorithm is optimal with respect to the actual environment:

41

Min max generalization Illustration ●

The puddle world benchmark

42

Min max generalization CGRL

The state space is uniformly covered by the sample

FQI (Fitted Q Iteration)

Min max generalization CGRL

FQI (Fitted Q Iteration)

The state space is uniformly covered by the sample

Information about the Puddle area is removed

44

Third contribution Sampling strategies

Sampling strategies Problem ●

Given a sample of system transitions

How one could determine where to sample additional transitions ?

46

Sampling strategies Using bounds ●

A possible approach: using bounds

Sampling strategies Using bounds ●



A possible approach: using bounds An upper bound can be computed in the same way as the lower bound detailed in the previous contribution

Sampling strategies Using bounds ●





A possible approach: using bounds An upper bound can be computed in the same way as the lower bound detailed in the previous contribution We have proposed an approach that samples additional transitions in order to increase the tightness of the bounds

Sampling strategies Using bounds ●





A possible approach: using bounds An upper bound can be computed in the same way as the lower bound detailed in the previous contribution We have proposed an approach that samples additional transitions in order to increase the tightness of the bounds

However ●

This strategy requires the resolution of an intractable optimization problem

Sampling strategies Using bounds ●





A possible approach: using bounds An upper bound can be computed in the same way as the lower bound detailed in the previous contribution We have proposed an approach that samples additional transitions in order to increase the tightness of the bounds

However ●

This strategy requires the resolution of an intractable optimization problem



We propose another (tractable) approach

51

Sampling strategies Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL

Sampling strategies Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:

Sampling strategies Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:

Sampling strategies Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:



We add the predicted transition to the current sample, a we compute a predicted control policy

Sampling strategies Falsification-based sampling strategy ●

We assume that we have access to a predictive model PM of the environment, and to a batch mode RL algorithm BMRL



Using the sample of already collected transitions, we first compute a control policy:



We uniformly draw a state-action point (x,u), and we compute a predicted transition:





We add the predicted transition to the current sample, a we compute a predicted control policy If the predicted control policy falsifies the current control policy, then we sample a new transition, else we iterate with a new state-action point (x',u')

56

Sampling strategies Illustration ●

The car-on-the-hill benchmark



PM: nearest neighbor algorithm



BMRL: nearest neighbor model learning RL algorithm



We generate 50 databases of 1000 system transitions



We evaluate the performances of the inferred control policies on the real system

57

Sampling strategies Illustration ●

Performance analysis: 50 runs of the falsification-based strategy (blue) are compared with 50 uniforms runs (red)

58

Sampling strategies Illustration ●

Distribution of the returns of control policies at the end of the sampling process

59

Sampling strategies Illustration ●

Graphical representation of typical runs Falsifaction-based sampling strategy

Uniform sampling strategy

60

Fourth contribution Model-free Monte Carlo estimation

Model-free Monte Carlo estimation ●

We assume a stochastic environment



A control policy h is given:



The performance of h is evaluated through its expected return:

where

62

Model-free Monte Carlo estimation Problem ●

Given a (noisy) sample of system transitions

How one could estimate the expected return of

h?

63

Model-free Monte Carlo estimation ●

If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h

Model-free Monte Carlo estimation ●



If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h We propose an approach that mimics Monte Carlo (MC) estimation by rebuilding p artificial trajectories from one-step system transitions

Model-free Monte Carlo estimation ●





If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h We propose an approach that mimics Monte Carlo (MC) estimation by rebuilding p artificial trajectories from one-step system transitions These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once

Model-free Monte Carlo estimation ●





If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h We propose an approach that mimics Monte Carlo (MC) estimation by rebuilding p artificial trajectories from one-step system transitions These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once



We average the cumulated returns over the p artificial trajectories to obtain the Model-free Monte Carlo estimator (MFMC) of the expected return of h:

67

Model-free Monte Carlo estimation Example with T = 3, p = 2, n = 8

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation

Model-free Monte Carlo estimation Theoretical properties ●

Lipschitz continuity assumptions:

89

Model-free Monte Carlo estimation ●

Distance metric ∆



k-sparsity



denotes the distance of (x,u) to its k-th nearest neighbor (using the distance ∆) in the sample 90

Model-free Monte Carlo estimation ●

Expected value of the MFMC estimator

Theorem (Bound on the bias of the MFMC estimator)

91

Model-free Monte Carlo estimation ●

Variance of the MFMC estimator

Theorem (Bound on the variance of the MFMC estimator)

92

Model-free Monte Carlo estimation Illustration

pW(.) is uniform over W, T = 15, x0 = - 0.5 .

93

Model-free Monte Carlo estimation ●

Simulations for p = 10, n = 100 … 10 000, uniform grid Model-free Monte Carlo estimator

Monte Carlo estimator

94

N = 100 … 10 000, p = 10

Model-free Monte Carlo estimation ●

Simulations for p = 1 … 100, n = 10 000 , uniform grid Model-free Monte Carlo estimator

p = 1 … 100, n=10 000

Monte Carlo estimator

p = 1 … 100

95

Fifth contribution Variable selection

Variable selection Problem ●

Given a sample of system transitions

How to compute a control policy based on small subsets of variables ?

97

Variable selection ●

The batch mode RL algorithm fitted Q iteration computes successive approximate state-action value functions Q (from N = 1 to N = T) from the sample of system transitions

N

Variable selection ●

The batch mode RL algorithm fitted Q iteration computes successive approximate state-action value functions Q (from N = 1 to N = T) from the sample of system transitions



N

This algorithm is particularly efficient when the ensembles of regression trees

QN  functions are approximated using

Variable selection ●

The batch mode RL algorithm fitted Q iteration computes successive approximate state-action value functions Q (from N = 1 to N = T) from the sample of system transitions



N

This algorithm is particularly efficient when the ensembles of regression trees



QN  functions are approximated using

These ensembles of trees can be used to compute the relevance of variables using a variance reduction criterion

100

Variable selection Strategy ●

Compute the functions (from N = 1 to N = T) by running the FQI algorithm on the original sample of transitions

Variable selection Strategy ●



Compute the functions (from N = 1 to N = T) by running the FQI algorithm on the original sample of transitions Compute the relevance of different attributes x(i) using the score evaluation:

where: v(.) is the empirical variance of a sample if x(i) is used to split node ν

Variable selection Strategy ●



Compute the functions (from N = 1 to N = T) by running the FQI algorithm on the original sample of transitions Compute the relevance of different attributes x(i) using the score evaluation:

where: v(.) is the empirical variance of a sample if x(i) is used to split node ν ●

Rerun the FQI algorithm on a modified sample of transitions "projected" over best attributes 103

Variable selection Illustration ●

Preliminary validation on the car-on-the-hill benchmark



Non-relevant variables (NRV) are added to the state vector

104

Conclusions & Future works

Conclusions & future works Summary ●

We have proposed a new approach for computing bounds and for addressing the generalization problem in a cautious manner in deterministic settings

Conclusions & future works Summary ●



We have proposed a new approach for computing bounds and for addressing the generalization problem in a cautious manner in deterministic settings We have proposed strategies for generating informative batch collections of data

Conclusions & future works Summary ●

We have proposed a new approach for computing bounds and for addressing the generalization problem in a cautious manner in deterministic settings



We have proposed strategies for generating informative batch collections of data



We have introduced a new model-free estimator for stochastic environments

Conclusions & future works Summary ●

We have proposed a new approach for computing bounds and for addressing the generalization problem in a cautious manner in deterministic settings



We have proposed strategies for generating informative batch collections of data



We have introduced a new model-free estimator for stochastic environments



We have proposed a variable selection technique for batch mode RL

Conclusions & future works Summary ●

We have proposed a new approach for computing bounds and for addressing the generalization problem in a cautious manner in deterministic settings



We have proposed strategies for generating informative batch collections of data



We have introduced a new model-free estimator for stochastic environments



We have proposed a variable selection technique for batch mode RL



These approaches have been either analytically investigated or empirically validated

110

Conclusions & future works Future works ●

Extending approaches to stochastic frameworks, partially observable environments,...

Conclusions & future works Future works ●



Extending approaches to stochastic frameworks, partially observable environments,... Developing risk-sensitive approaches

Conclusions & future works Future works ●

Extending approaches to stochastic frameworks, partially observable environments,...



Developing risk-sensitive approaches



Analyzing empirical approaches

Conclusions & future works Future works ●

Extending approaches to stochastic frameworks, partially observable environments,...



Developing risk-sensitive approaches



Analyzing empirical approaches



Developing new batch mode RL paradigms based on artificial trajectories

Conclusions & future works Future works ●

Extending approaches to stochastic frameworks, partially observable environments,...



Developing risk-sensitive approaches



Analyzing empirical approaches



Developing new batch mode RL paradigms based on artificial trajectories



Validating algorithms on actual clinical data

115

Contributions to Batch Mode Reinforcement Learning

Feb 24, 2011 - A new approach for computing bounds on the performances of control policies in batch mode RL. ✓ A min max approach to generalization in ...

3MB Sizes 1 Downloads 276 Views

Recommend Documents

Contributions to Batch Mode Reinforcement Learning
B Computing bounds for kernel–based policy evaluation in reinforcement learning. 171. B.1 Introduction ... a subproblem of reinforcement learning: computing a high-performance policy when the only information ...... to bracket the performance of th

Batch Mode Reinforcement Learning based on the ... - Orbi (ULg)
Dec 10, 2012 - Theoretical Analysis. – Experimental Illustration ... data), marketing optimization (based on customers histories), finance, etc... Batch mode RL.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

Batch mode reinforcement learning based on the ...
May 12, 2014 - "Model-free Monte Carlo-like policy evaluation". ... International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR ...

Batch Mode Reinforcement Learning based on the ...
We give in Figure 1 an illustration of one such artificial trajectory. ..... 50 values computed by the MFMC estimator are concisely represented by a boxplot.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Batch mode reinforcement learning based on the ...
May 12, 2014 - Proceedings of the Workshop on Active Learning and Experimental Design ... International Conference on Artificial Intelligence and Statistics ...

Batch Mode Reinforcement Learning based on the ...
Nov 29, 2012 - Reinforcement Learning (RL) aims at finding a policy maximizing received ... data), marketing optimization (based on customers histories), ...

Working Memory Contributions to Reinforcement ...
Anne G.E. Collins,1 Jaime K. Brown,2 James M. Gold,2 James A. Waltz,2 and X Michael ... Here, we used this task to assess patients' specific sources of impairments in learning. ..... by summarizing individual differences into meaningful model.

Batch Mode Adaptive Multiple Instance Learning for ... - IEEE Xplore
positive bags, making it applicable for a variety of computer vision tasks such as action recognition [14], content-based image retrieval [28], text-based image ...

MeqTrees Batch Mode: A Short Tutorial - GitHub
tdlconf.profiles is where you save/load options using the buttons at ... Section is the profile name you supply ... around the Python interface (~170 lines of code).

Reinforcement Learning Trees
Feb 8, 2014 - size is small and prevents the effect of strong variables from being fully explored. Due to these ..... muting, which is suitable for most situations), and 50% ·|P\Pd ..... illustration of how this greedy splitting works. When there ar

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

Min Max Generalization for Deterministic Batch Mode ...
Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia ...

Min Max Generalization for Deterministic Batch Mode ...
Nov 29, 2013 - Formalization. ○. Deterministic dynamics: ○. Deterministic reward function: ○. Fixed initial state: ○. Continuous sate space, finite action space: ○. Return of a sequence of actions: ○. Optimal return: ...

Min Max Generalization for Deterministic Batch Mode ...
Sep 29, 2011 - University of Liège. Mini-workshop on Reinforcement Learning. Department of Electrical Engineering and Computer Science. University of ...

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Nov 29, 2013 - One can define the sets of Lipschitz continuous functions ... R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial.

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)
Electrical Engineering and Computer Science Department. University of Liège, Belgium. November, 29th, 2013. Maastricht, The Nederlands ...

continuous action reinforcement learning applied to ...
order to reduce the signal conversion hardware to a single A/D and a single D/A converter a .... National Conf. on Artificial Intelligence 781-786 (1991). 4. Wu Q ...

continuous action reinforcement learning applied to ...
This represents a difficult class of learning problem, owing to the stochastic nature of the ... of machine learning that tries to maximise a scalar reward through ...

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...