Online Supplement: Improving UCT Planning via Approximate Homomorphisms Nan Jiang1 , Satinder Singh1 , and Richard Lewis2 1

1

Computer Science and Engineering , University of Michigan 2 Department of Psychology, University of Michigan

Theoretical Analysis

The main idea of our algorithm is to build empirical local layered MDPs based on the trajectories sampled by UCT and then to find approximate homomorphisms in them. This construction is lossy in two ways: 1. The empirical MDP is different from the original MDP. 2. The abstraction is built hence has approximation errors from the empirical MDP. We will show how to combine the loss from both sources. First, we define the notion of loss in value along with some notation useful for the analysis. Notation & Objective of Analysis: For the current state of interest, let the true local layered MDP with depth dmax be M . In the first ˆ . In step that introduces value-loss, UCT samples n trajectories in M and builds an empirical MDP M ˆ to an abstract a second step that also introduces value-loss, an approximate homomorphism h maps M 0 0 ˆ h constructed by applying Algorithm 1 in the main paper to M ˆ with parameter ( T , R ) (hence MDP M 2 2 the approximation error of the constructed abstraction is at most (0T , 0R )) 1 . Our analytical objective is to π∗ π∗ π bound the loss of the abstraction, i.e., to bound kVM − VMh k∞ , where VM is the expected value function ∗ ˆh of policy π evaluated in MDP M , and π is the optimal policy in M , while πh∗ is the optimal policy in M lifted to true local MDP M . Theorem 1. (Main Result) ∀ηT , ηR > 0, 0 < δ < 1, π∗



π kVM − VMh k∞ ≤

2(0R + ηR ) γ(Rmax − Rmin )(0T + ηT ) + 1−γ (1 − γ)2

holds with probability at least 1 − δ if    n > max{exp(dmax ) log a(KB)dmax /δ /b , N } where 1. K is the number of actions available in state, 2. B is the maximal number of possible next states from a state-action pair, def

3. p =

min (s,a,s1 ,d):P (s,a,s1 ,d))>0

P (s, a, s1 , d))/2,

ˆ and M ˆ h, 4. [Rmin , Rmax ] is the range of reward in M, M 1 We

ˆ and M ˆ h , and similarly for reward function. use P , Pˆ and Pˆh to distinguish transition probabilities of M , M

1

(1)

5. N, c are positive constants that do not depend on the choice of ηT , ηR , def

6. a = max{3dmax , 6B}, def

2 7. b = min{2cp2 , 2cηR /(Rmax − Rmin )2 , 2cηT2 /B 2 }.

We prove Theorem 1 using the following three lemmas: Lemma 2. The probability that max s,a,d

X Pˆ (s, a, s1 , d)) − P (s, a, s1 , d)) ≤ ηT s1

(2) ˆ a, d) − R(s, a, d)| ≤ ηR max |R(s, s,a,d

holds is at least  2 1 − (KB)dmax dmax exp(−2p2 c log(dmax ) (n)) + 2 exp(−2c log(dmax ) (n)ηR /(Rmax − Rmin )2 )  +2B exp(−2ηT2 c log(dmax ) (n)/B 2 ) .

(3)

ˆ h be (T , R ). If Eq.(2) holds, T ≤ 0 + ηT Lemma 3. Let the approximation parameter for h : M 7→ M T 0 and R ≤ R + ηR . Lemma 4. (Ravindran and Barto, 2004 [1]) ∗

π∗

π kVM − VMh k∞ ≤

γ(Rmax − Rmin )T 2R + . 1−γ (1 − γ)2

ˆ a, d) The key idea in the proof of Lemma 2 is to consider each (s, a, d) and bound the probability that R(s, and Pˆ (s, a, ·, d) are not accurate, and then bound the probability that inaccurate estimates do not occur at any (s, a, d) by union bound. To obtained the former result, we first need to bound the number of times an (s, a, d) tuple is visited, which is given in the following lemma. Lemma 5. ∃c > 0, N > 0 s.t. when n > N , n o P ns,a,d ≥ c log(d+1) (n) ≥ 1 − exp(−2p2 c log(dmax ) (n)))d . Proof. (By induction.) At d = 0, which is the root, the state is visited exactly n times. According to Theorem 3 in [2], ∃ρ > 0 s.t. ns,a,d ≥ ρ log(ns,d ). Therefore, ns,a,0 ≥ ρ log(ns,0 ) > c log(n) with probability 1 as long as c < ρ. Now consider arbitrary d < dmax . Let the state-action pair at the previous level that leads to s be (s0 , a0 , d−1). According to the induction assumption, n o P ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ (1 − exp(−2p2 c log(dmax ) (n)))d−1 . (4) n o What we need to bound is P ns,a,d ≥ c log(d+1) (n) , which can be decomposed in the following way n o P ns,a,d ≥ c log(d+1) (n) n o ≥ P ns,a,d ≥ c log(d+1) (n), ns0 ,a0 ,d−1 ≥ c log(d) (n) n o n o = P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) . 2

The second term has already been bounded in Eq.(4). We will bound the first term in two steps: first, we show that o o n n P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ P ns,d /ns0 ,a0 ,d−1 ≥ p ns0 ,a0 ,d−1 ≥ c log(d) (n) . (5) This is because when ns,d /ns0 ,a0 ,d−1 ≥ p holds, ns,a,d ≥ ρ log(ns,d ) ≥ ρ log(pc log(d) (n))) = ρ log(d+1) (n) + ρ log(pc). Note that the second term is a constant, thus for any 0 < c < ρ, as long as log(dmax ) (n) > log(pc)/(c − ρ) (solving this inequality yields N , which does not depend on ηT and ηR ) we have ns,a,d ≥ c log(d+1) (n). This shows that the right side event of Eq.(5) is a sub-event of the left side, thus the inequality holds. Second, we bound the right side of Eq.(5). For any fixed ns0 ,a0 ,d−1 , ns0 ,a0 ,d−1,s1 /ns0 ,a0 ,d−1 is the average of Bernoulli random variables with expected value P (s0 , a0 , s, d − 1). According to Hoeffding bound, n o n o P ns,d /ns0 ,a0 ,d−1 ≥ p ≥ P ns0 ,a0 ,s,d−1 /ns0 ,a0 ,d−1 ≥ p n o = P ns0 ,a0 ,s,d−1 /ns0 ,a0 ,d−1 − P (s0 , a0 , s, d − 1) ≥ p − P (s0 , a0 , s, d − 1) ≥ 1 − exp(−2(P (s0 , a0 , s, d − 1) − p)2 ns0 ,a0 ,d−1 ). According to the definition of p in Eq.(3), we always have P (s0 , a0 , s, d − 1) − p > p. With ns0 ,a0 ,d−1 ≥ c log(d) (n), we have n o P ns,d /ns0 ,a0 ,d−1 ≥ p ≥ 1 − exp(−2p2 c log(d) (n)) Hence n o n o n o P ns,a,d ≥ c log(d+1) (n) ≥ P ns,a,d ≥ c log(d+1) (n) ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) n o n o ≥ P ns,d /ns0 ,a0 ,d−1 ≥ p ns0 ,a0 ,d−1 ≥ c log(d) (n) · P ns0 ,a0 ,d−1 ≥ c log(d) (n) ≥ (1 − exp(−2p2 c log(d) (n)))(1 − exp(−2p2 c log(dmax ) (n)))d−1 ≥ (1 − exp(−2p2 c log(dmax ) (n)))d . So the lemma follows. Proof of Lemma 2. Consider a state-action-depth tuple (s, a, d). For the empirical reward and transition probabilities to be accurate at (s, a, d), we first require that (s, a, d) is visited sufficiently. By relaxing Lemma 5 we have a universal bound for ns,a,d that is independent of d: ∀(s, a, d), ∃c, N 0 , s.t. ∀n > N 0 , n o P ns,a,d ≥ c log(dmax ) (n) ≥ (1 − exp(−2p2 c log(dmax ) (n)))dmax where c and N 0 are the constants specified in Lemma 5. Now we can bound the probability that reward and transition estimates are inaccurate separately. The ˆ a) is the average of at least c log(dmax ) (n) i.i.d. samples of random variables that lie in empirical reward R(s, [Rmin , Rmax ], with expected value R(s, a), hence by Hoeffding bound n o 2 ˆ a) − R(s, a)| > ηR ns,a,d ≥ c log(dmax ) (n) ≤ 2 exp(−2c log(dmax ) (n)ηR P |R(s, /(Rmax − Rmin )2 ). Similarly for transition probabilities, nX o P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT ns,a,d ≥ c log(dmax ) (n) s

1 n[ o ≤ P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n)

s1

o X n ≤ P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n) . s1

3

Consider a particular possible next state s1 , o n P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT /B ns,a,d ≥ c log(dmax ) (n) ≤ 2 exp(−2ηT2 c log(dmax ) (n)/B 2 ). As (s, a) has at most B possible next states, nX o P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| > ηT ns,a,d ≥ c log(dmax ) (n) ≤ 2B exp(−2ηT2 c log(dmax ) (n)/B 2 ). s1

By observing that empirical reward and transition distribution are independent of each other when fixing ns,a,d , we have the following result: ∀(s, a, d), n o X ˆ a, d) − R(s, a, d)| ≤ ηR , P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT s1

n o X ˆ a, d) − R(s, a, d)| ≤ ηR , ≥ P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT , ns,a,d ≥ c log(dmax ) (n) s1

o n o n ˆ a, d) − R(s, a, d)| ≤ ηR ns,a,d ≥ c log(dmax ) (n) = P ns,a,d ≥ c log(dmax ) (n) · P |R(s, nX o ·P |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT ns,a,d ≥ c log(dmax ) (n) s1 2 ≥ (1 − exp(−2p2 c log(dmax ) (n)))dmax (1 − 2 exp(−2c logd+1 (n)ηR /(Rmax − Rmin )2 ))

(1 − 2B exp(−2ηT2 c log(dmax ) (n)/B 2 )).

(6)

The final step is to bound the probability that the estimate is accurate everywhere. With union bound, n o P M 0 is not (T , R ) accurate n [ o =P M 0 is not (T , R ) accurate at (s, a, d) (s,a,d)



X

n o P M 0 is not (T , R ) accurate at (s, a, d)

(s,a,d)

≤ #(s, a, d) ·

! n o X ˆ a, d) − R(s, a, d)| ≤ ηR , 1 − P |R(s, |Pˆ (s, a, s1 , d) − P (s, a, s1 , d)| ≤ ηT . s1

And the bound in Lemma 2 is obtained by plugging in Eq.(6) and noticing that #(s, a, d) ≤ (KB)dmax , and can be simplified by its first order approximation (which is strictly smaller). Finally, from Theorem 2 to our main result, we only have to require that each term in the outmost parenthesis in Eq.(3) is less than δ/3(KB)dmax , and find the satisfying n.

4

Proof of Lemma 3. def

T = max s,a,d

X Pˆh (h(s), a, x, d) − x

s1 :h(s1 )=x

X ≤ max Pˆh (h(s), a, x, d) − s,a,d

x

x

X

s,a,d



Pˆ (s, a, s1 , d) −

s1 :h(s1 )=x

X = 0T + max 0T

Pˆ (s, a, s1 , d)

X s1 :h(s1 )=x

X + max s,a,d

P (s, a, s1 , d)

X

+ max s,a,d

x

X

X

P (s, a, s1 , d)

s1 :h(s1 )=x

 Pˆ (s, a, s1 , d) − P (s, a, s1 , d)

X s1 :h(s1 )=x

X

Pˆ (s, a, s1 , d) − P (s, a, s1 , d)

x s1 :h(s1 )=x

X Pˆ (s, a, s1 , d) − P (s, a, s1 , d) = 0T + max s,a,d

=

0T

s1

+ ηT .

The proof of R ≤ 0R + ηR is very similar hence omitted.

References [1] Balaraman Ravindran and A Barto. Approximate homomorphisms: A framework for nonexact minimization in Markov decision processes. In 5th International Conference on Knowledge-Based Computer Systems, 2004. [2] Levente Kocsis and Csaba Szepesv´ ari. Bandit based Monte-Carlo planning. In 15th European Conference on Machine Learning, pages 282–293, 2006.

5

Online Supplement: Improving UCT Planning via ...

K is the number of actions available in state,. 2. B is the maximal number of possible next states from a state-action pair,. 3. p def. = min. (s,a,s1,d):P (s,a,s1,d))>0.

242KB Sizes 0 Downloads 204 Views

Recommend Documents

Improving Predictive State Representations via ... - Alex Kulesza
Computer Science & Engineering. University of Michigan. Abstract. Predictive state representations (PSRs) model dynam- ical systems using appropriately ...

Improving Word Representations via Global Visual Context
Department of Electrical Engineering and Computer Science. University of Michagan [email protected]. Abstract. Visually grounded semantics is a very ...

Improving Word Representations via Global Visual Context
Department of Electrical Engineering and Computer Science ... In this work, we propose to use global visual context to help learn better word ... In this way, we are able to measure how global visual information contributes (or affects) .... best and

Improving quantum microscopy and lithography via ...
Jul 27, 2004 - 2 Department of Aerospace and Mechanical Engineering, Princeton University, Princeton, ... 3 Institute for Quantum Studies and Department of Physics, Texas A&M University, College ..... If the pump field is tuned close to.

Online Supplement
Proof Suppose u ∈ S. Then, Φt(u) converges to the origin exponentially as t ... ∗ISE dept., KAIST, Daejeon, South Korea, Email: [email protected], Tel.

Agent Programming via Planning Programs
May 12, 2009 - Possibly non-terminating. • Atomic instructions: requests to “achieve goal φ while maintaining goal ψ”. • Meant to run in a dynamic domain: the ...

Improving Support Vector Machine Generalisation via Input ... - IJEECS
[email protected]. Abstract. Data pre-processing always plays a key role in learning algorithm performance. In this research we consider data.

Improving Automated Source Code Summarization via ...
Jun 7, 2014 - Department of Computer Science and Engineering. University of Notre Dame ... keywords or statements from the code, and 2) build a sum- mary from this .... Therefore, in RQ2, we study the degree to which programmers emphasize these ... r

Improving the Adaptability of Multi-mode Systems via ...
republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ... video games select which model of an object to render: detailed ..... and the automated modeling step can sift through these to find the ones t

Improving Support Vector Machine Generalisation via Input ... - IJEECS
for a specific classification problem. The best normalization method is also selected by SVM itself. Keywords: Normalization, Classification, Support Vector.

Improving Speaker Identification via Singular Value ...
two different approaches have been proposed using Singular. Value Decomposition (SVD) based Feature Transformer (FT) for improving accuracies especially for lower ordered speaker models. The results show significant improvements over baseline and hav

Improving Support Vector Machine Generalisation via ...
Abstract. Data pre-processing always plays a key role in learning algorithm performance. In this research we consider data pre-processing by normalization for Support Vector. Machines (SVMs). We examine the normalization effect across 112 classificat

Improving quantum microscopy and lithography via ...
Jul 27, 2004 - for a long time so that it makes sense to seek the state of the field at time t = ∞ ..... separated by a distance d sufficiently large that dipole–dipole.

Diagnosing Algebra Understanding via Inverse Planning
algebra skills that interprets students' step-by-step problem solving, placing no ... In principle, computers should be able to make such infer- .... complete an online worksheet and solve twenty problems ... understanding of basic algebra.

Optimality Properties of Planning via Petri Net Unfolding - CiteSeerX
Talk Overview. Planning Via Unfolding. 1. What it is. 2. Concurrency Semantics. 3. Optimality properties wrt flexibility and execution time. 3 / 18 ...

1499590086608-planning-lodestar-bent-via-the-the-coolest ...
Whoops! There was a problem loading more pages. Retrying... 1499590086608-planning-lodestar-bent-via-the-the-coolest-business-planning-escort.pdf.

Diagnosing Algebra Understanding via Inverse Planning
might be used to guide remediation. We then evaluated our model's performance on human data by recruiting participants on Amazon Mechanical Turk to complete an online worksheet and solve twenty problems on the Berkeley Algebra Tutor website, which we

1499590086608-planning-lodestar-bent-via-the-the-coolest ...
Page 2 of 3. Page 3 of 3. Page 3 of 3. 1499590086608-planning-lodestar-bent-via-the-the-coolest-business-planning-escort.pdf.

Optimality Properties of Planning via Petri Net Unfolding - CiteSeerX
Unfolding: A Formal Analysis ... Exact reachability analysis. ▻ Preserves and ... induced by Smith and Weld's [1999] definition of independent actions? 6 / 18 ...

Agent Programming via Planning Progam Realization
Abstract. This work proposes a novel high-level paradigm, agent planning programs, for modeling agents behavior, which suitably mixes automated planning ...

ACM ICPC — Training Session IV - UCT
Sep 13, 2014 - Task: calculate time it takes to arrive home (≤24h), or impossible. Linear search to crunch through the segments and update, or binary search ...

Electivos UCT 2016-2.pdf
... A -222 San Francisco Loreto Escalante/Marco Bellott loreto.escalante@gm ... A-222 San Francisco Jaime García/Nabil Rodriguez [email protected].

Framework_Completion Part-Time Doctoral Studies ... - UCT Students
FRAMEWORK. Study Support for Completing. Part-time Doctoral Students. Directorate Human and Infrastructure Capacity Development. Date. July 2015 ...