PAC Reinforcement Learning with an Imperfect Model Nan Jiang Microsoft Research, NYC

Motivation: sim2real transfer for RL

Sufficient conditions and algorithms

● Empirical success of deep RL (Atari games, MuJoCo, Go, etc.) ● Popular algorithms are sample-intensive for real-world applications

Definition 1: A partially corrected model MX is one whose dynamics are the same as M on X, and the same as M otherwise.

● Sim2real approach: (1) train in a simulator, (2) transfer to real world

Condition 1: V*(s0) is always higher in MX than in M for all X ⊆ Xξ-inc.

● Hope: reduce sample complexity with a high-fidelity simulator Compute policy

Collect data

(see the agnostic version of the conditions in the paper.) Theorem 1: Under Condition 1, there exists an algorithm that achieves O(|Xξ-inc|2 H4 log(1/δ)/ε3) sample complexity for ξ = O(ε/H2). Algorithm 1: illustration on previous example, Model 3.

Verify

Calibrate Simulator

Real environment

Figures from [1]

A simple theoretical question: If the simulator is only wrong in a small number of state-action pairs, can we substantially reduce #real trajectories needed?

● Collect data using optimal policy in simulator. ● Blue cells: plug in estimated dynamics along states w/ enough samples. Key steps in analysis: ● Accurate estimation of transition may require O(|S|) samples per (s, a). ● Incur dependence on |S|… need to avoid. ● Workaround: union bound over V* of all partially corrected models, which only incurs log(2|Xξ-inc|).

Answer: No! Further conditions are needed… Deeper thoughts: many scenarios in sim2real transfer ● What to transfer: policy, features, skills, etc. (we focus on policy)

What if we cannot change the model?

● How to quantify fidelity

Basic idea: ● Identify the wrong states as necessary. ● Terminate a simulated episode when running into wrong (s, a). = penalize a wrong (s, a) by fixing Q(s, a) = 0 (Vmin) in planning.

○ Prior theories (e.g., [2]) focus on global error (worst over all states) ○ Local errors (#states with large errors)? ● Is interactive protocol really better than non-interactive? Answer: Yes! (non-interactive: collect real data, calibrate the model, done)

Definition 2: A partially penalized model M\X is one that terminates on X, and have the same dynamics as M otherwise. Condition 2: V*(s0) is always higher in M\X than in M for all X ⊆ Xξ-inc.

Setup

Theorem 2: Under Condition 2, there exists an algorithm that achieves O(|Xξ-inc|2 H2 log(1/δ)/ε3) sample complexity for ξ = O(ε/H).

● Real environment: episodic MDP M = (S, A, P, R, H, s0).

Algorithm 2: M0 ← M , X0 ← {}. For t = 0, 1, 2, … ● Let πt be the optimal policy of Mt . Monte-Carlo evaluate πt.

● Simulator: M = (S, A, P , R, H, s0). ● Define Xξ-inc as the set of “wrong” (s, a) pairs where ● Goal: learn a policy π such that V*(s0) - Vπ(s0) ≤ ε, using only poly(|Xξ-inc|, H, 1/ε, 1/δ) real trajectories.

● Return if Vπt(s0) in M is close to V*(s0) in Mt. ● Sample real trajectories using πt . ● Once #samples from some (s, a) reaches threshold, compute

No dependence on |S| or |A; instead, adapt to the simulator’s quality. ● This is impossible without further assumptions...

● If large, Xt+1←Xt ∪ {(s, a)}, Mt+1←M\X .

Lower bound and hard instances

Non-interactive protocol is inefficient

● Lower bound: Ω(|S×A|/ε2), even when |X0-inc| = constant!

Theorem 3: “Collect data, calibrate, done” style algorithms cannot have poly(|Xξ-inc|, H, 1/ε, 1/δ) sample complexity, even with Conditions 1 & 2.

t+1

● Proof sketch: ○ Bandit hard instance: M = all arms Ber(½), except one w/ Ber(½+ ε)

Proof sketch: assume such an algorithm exists. Then,

○ Approximate model: M = all arms Ber(½) --- |X0-inc|=1 but useless

● The same dataset can calibrate multiple models.

● Illustration:

● Consider the hard instance in bandit. Design |A|2 models: ∀ a, a’∈ A, Ma, a’ = all arms Ber(½), except a & a’ w/ Ber(½+ ε). ● When a = a*, both Conditions 1 & 2 are met and |X0-inc| = 1.

Real environment

Model 1 (hard instance)

Model 2 (hard instance)

Model 3 (good cases)

○ Issue with Model 1: too pessimistic ○ Issue with Model 2: initially optimistic; pessimistic once error fixed ○ Good property of Model 3: always optimistic

● Hypothetical algorithm prefers a* to a’ with ⅔ prob. using a dataset of constant size. ● Majority vote from O(log|A|) datasets: boost success prob. to 1 - O(1/|A|). ● Solve bandit hard instance w/ polylog(|A|), against Ω(|A|) lower bound.

[1] Rusu et al. Sim-to-real robot learning from pixels with progressive nets. CoRL 2017. [2] Cutler et al. 2015. Real-world reinforcement learning via multifidelity simulators. IEEE Transaction on Robotics, 2015.

PAC Reinforcement Learning with an Imperfect Model

than in M for all X ⊆ X ξ-inc . (see the agnostic version of the conditions in the paper.) Theorem 1: Under Condition 1, there exists an algorithm that achieves. O(|X ξ-inc. |. 2. H. 4 log(1/δ)/ε. 3. ) sample complexity for ξ = O(ε/H. 2. ). Algorithm 1: illustration on previous example, Model 3. ○ Collect data using optimal policy in ...

226KB Sizes 13 Downloads 311 Views

Recommend Documents

PAC Reinforcement Learning with an Imperfect Model
The pseudocode is given in Algorithm 1. We first walk through the pseudocode and give some intuitions, and then state and prove the sample complexity guarantee. The outer loop of the algorithm computes πt = π⋆. Mt , the optimal policy of the curr

Reinforcement Learning: An Introduction
important elementary solution methods: dynamic programming, simple Monte ..... To do this, we "back up" the value of the state after each greedy move to the.

Reinforcement Learning Agents with Primary ...
republish, to post on servers or to redistribute to lists, requires prior specific permission ... agents cannot work well quickly in the soccer game and it affects a ...

A Theory of Model Selection in Reinforcement Learning - Deep Blue
seminar course is my favorite ever, for introducing me into statistical learning the- ory and ..... 6.7.2 Connections to online learning and bandit literature . . . . 127 ...... be to obtain computational savings (at the expense of acting suboptimall

A Theory of Model Selection in Reinforcement Learning
4.1 Comparison of off-policy evaluation methods on Mountain Car . . . . . 72 ..... The base of log is e in this thesis unless specified otherwise. To verify,. γH Rmax.

Model Based Approach for Outlier Detection with Imperfect Data Labels
outlier detection, our proposed method explicitly handles data with imperfect ... Outlier detection has attracted increasing attention in machine learning, data mining ... detection has been found in wide ranging applications from fraud detection ...

Model Based Approach for Outlier Detection with Imperfect Data Labels
much progress has been done in support vector data description for outlier detection, most of the .... However, subject to sampling errors or device .... [9] UCI Machine Learning Repository [Online]. http://archive.ics.uci.edu/ml/datasets.html.

Reinforcement Learning Trees
Feb 8, 2014 - size is small and prevents the effect of strong variables from being fully explored. Due to these ..... muting, which is suitable for most situations), and 50% ·|P\Pd ..... illustration of how this greedy splitting works. When there ar

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

Reinforcement Learning with Exploration Stuart Ian ...
I thank Remi Munos and Andrew Moore for hosting my enlightening (but ultimately too short) sabbatical with them at Carnegie Mellon, and my department for funding the ...... This follows simply from noting that if a SAP leads to a higher mean return t

Cold-Start Reinforcement Learning with Softmax ... - Research at Google
Our method com- bines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in ... performance compared to MLE training in various tas

Cold-Start Reinforcement Learning with Softmax ... - NIPS Proceedings
Policy search in reinforcement learning refers to the search for optimal parameters for a given policy parameterization [5]. ... distribution and the reward distribution, learning would be efficient, and neither warm-start training nor sample varianc

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...

A Model of Housing and Credit Cycles with Imperfect ...
Mar 5, 2014 - In addition, they find a nonnegligible spillover effect from housing markets ... The model builds on the basic version of the KM model with the major ...... The following illustration may help to understand the E-stability condition.

Bipartite Graph Reinforcement Model for Web Image ...
Sep 28, 2007 - retrieving abundant images on the internet. ..... 6.4 A Comparison with HITS Algorithm ... It is a link analysis algorithm that rates web pages.

neural architecture search with reinforcement ... -
3.3 INCREASE ARCHITECTURE COMPLEXITY WITH SKIP CONNECTIONS AND OTHER. LAYER TYPES. In Section 3.1, the search space does not have skip connections, or branching layers used in modern architectures such as GoogleNet (Szegedy et al., 2015), and Residua

Oates' Decentralization Theorem with Imperfect ...
Nov 26, 2013 - In our model, agents are heterogeneous so that their result does ...... Wildasin, D. E. (2006), “Global Competition for Mobile Resources: Impli-.

A Two-tier User Simulation Model for Reinforcement ...
policies for spoken dialogue systems using rein- forcement ... dialogue partners (Issacs and Clark, 1987). ... and we collect data on user reactions to system REG.

Reinforcement Learning as a Context for Integrating AI ...
placing it at a low level would provide maximum flexibility in simulations. Furthermore ... long term achievement of values. It is important that powerful artificial ...