DESPOT: Online POMDP Planning with Regularization Supplementary Material

Adhiraj Somani Nan Ye David Hsu Wee Sun Lee Department of Computer Science, National University of Singapore [email protected], {yenan,dyhsu,leews}@comp.nus.edu.sg

1

Proof of Theorem 1

We will need two lemmas for proving Theorem 1. The first one is Haussler’s bound given in [1, p. 103] (Lemma 9, part (2)). Lemma 1 (Haussler’s bound) Pn Let Z1 , . . . , Zn be i.i.d random variables with range 0 ≤ Zi ≤ M , E(Zi ) = µ, and µ ˆ = n1 i=1 Zi , 1 ≤ i ≤ n. Assume ν > 0 and 0 < α < 1. Then Pr (dν (ˆ µ, µ) > α) < 2e−α where dν (r, s) =

|r−s| ν+r+s .

2

νn/M

As a consequence,   2 α 1−α µ ˆ− ν < 2e−α νn/M . Pr µ < 1+α 1+α

Let Πi be the class of policy trees in Πb0 ,D,K and having size i. The next lemma bounds the size of Πi . Lemma 2 |Πi | ≤ i(i−2) (|A||Z|)i . Proof. Let Π0i be the class of rooted ordered trees of size i. |Π0i | is not more than the number of all trees with i labeled nodes, because the in-order labeling of a tree in Π0i corresponds to a labeled tree. By Cayley’s formula [3], the number of trees with i labeled nodes is i(i−2) , thus |Π0i | ≤ i(i−2) . Recall the definition of a policy derivable from a DESPOT in Section 4 in the main text. A policy tree in Πi is obtained from a tree in Π0i by assigning the default policy to each leaf node, one of the |A| possible action labels to all other nodes, and one of at most |Z| possible labels to each edge. Therefore |Πi | ≤ i(i−2) · |A|i · |Z|(i−1) ≤ i(i−2) (|A||Z|)i .  In the following, we often abbreviate Vπ (b0 ) and Vˆπ (b0 ) as Vπ and Vˆπ respectively, since we will only consider the true and empirical values for a fixed but arbitrary b0 . Our proof follows a line of reasoning similar to [2]. Theorem 1 For any τ, α ∈ (0, 1) and any set Φb0 of K randomly sampled scenarios for belief b0 , every policy tree π ∈ Πb0 ,D,K satisfies  ln(4/τ ) + |π| ln KD|A||Z| 1−αˆ Rmax Vπ (b0 ) ≥ Vπ (b0 ) − · . 1+α (1 + α)(1 − γ) αK with probability at least 1 − τ , where Vˆπ (b0 ) denotes the estimated value of π under Φb0 . 1

Proof. Consider an arbitrary policy tree π ∈ Πb0 ,D,K . We know that for a random scenario φ for the belief b0 , executing the policy π w.r.t. φ gives us a sequence of states and observations distributed according to the distributions P (s0 |s, a) and P (z|s, a). Therefore, for π, its true value Vπ equals E (Vπ,φ ), where the expectation is over the distribution of scenarios. On the other hand, since PK 1 Vˆπ = K k=1 Vπ,φk , and the scenarios φ0 , φ1 , . . . , φK are independently sampled, Lemma 1 gives   2 1−αˆ α Pr Vπ < Vπ − |π| < 2e−α |π| K/M (1) 1+α 1+α where M = Rmax /(1 − γ), and i is chosen such that 2

2e−α

|π| K/M

= τ /(2i2 |Πi |).

(2)

By the union bound, we have      ∞ X X α 1−αˆ α 1−αˆ Vπ − |π| ≤ Pr Vπ < Vπ − |π| . Pr ∃π ∈ Πb0 ,D,K Vπ < 1+α 1+α 1+α 1+α i=1 π∈Πi

By (1), the right hand side of the above inequality is bounded by P∞the choice of i2’s and Inequality P∞ 2 2 |Π | · [τ /(2i |Π |)] = π τ /12 < τ , where the well-known identity 1/i = π2 /6 is i i i=1 i=1 used. Hence,    α 1−αˆ |π| < τ. Vπ − Pr ∃π ∈ Πb0 ,D,K Vπ < 1+α 1+α

(3)

Equivalently, with probability 1 − τ , every π ∈ Πb0 ,D,K satisfies Vπ ≥

α 1−αˆ Vπ − |π| . 1+α 1+α

(4)

To complete the proof, we now give an upper bound on |π| . From Equation 2, we can solve for 2

|Πi |) Rmax |π| to get i = α(1−γ) · ln(4/τ )+ln(i . For any π in Πb0 ,D,K , its size is at most KD, and αK 2 i i i |Πi | ≤ (i|A||Z|) ≤ (KD|A||Z|) by Lemma 2. Thus we have

|π| ≤

Rmax ln(4/τ ) + |π| ln(KD|A||Z|) · . α(1 − γ) αK

Combining this with Inequality (4), we get Vπ ≥

1−αˆ Rmax ln(4/τ ) + |π| ln(KD|A||Z|) Vπ − · . 1+α (1 + α)(1 − γ) αK

This completes the proof. 

2

Proof of Theorem 2

We need the following lemma for proving Theorem 2. Lemma 3 For a fixed policy π and any τ ∈ (0, 1), with probability at least 1 − τ . r Rmax 2 ln(1/τ ) ˆ Vπ ≥ Vπ − 1−γ K Proof. Let π be a policy and Vπ and Vˆπ as mentioned. Hoeffding’s inequality [4] gives us   2 2 Pr Vˆπ ≥ Vπ −  ≥ 1 − e−K /(2M ) 2

Let τ = e−K

2

/(2M 2 )

and solve for , then we get ! r R 2 ln(1/τ ) max Pr Vˆπ ≥ Vπ − ≥ 1 − τ. 1−γ K

 Theorem 2 Let π ∗ be an optimal policy at a belief b0 . Let π be a policy derived from a DESPOT that has height D and are constructed from K randomly sampled scenarios for belief b0 . For any τ, α ∈ (0, 1), if π maximizes 1−αˆ Rmax |π| ln(KD|A||Z|) Vπ (b0 ) − · , 1+α (1 + α)(1 − γ) αK

(5)

among all policies derived from the DESPOT, then Vπ (b0 ) ≥

1−α ∗ 1+α Vπ (b0 )



Rmax (1+α)(1−γ)





ln(8/τ )+|π ∗ | ln KD|A||Z| αK

+ (1 − α)

q

2 ln(2/τ ) K

+ γD



.

(6)

Proof. By Theorem 1, with probability at least 1 − τ /2,   Rmax ln(8/τ ) + |π| ln(KD|A||Z|) 1−αˆ Vπ − . Vπ ≥ 1+α (1 + α)(1 − γ) αK Suppose the above inequality holds on a random set of K scenarios. Note that there is a π 0 ∈ Πb0 ,D,K which is a subtree of π ? and has the same trajectories on these scenarios up to depth D. By the choice of π in Inequality (5), it follows that with probability at least 1 − τ /2,   1−αˆ Rmax ln(8/τ ) + |π 0 | ln(KD|A||Z|) Vπ ≥ Vπ0 − . 1+α (1 + α)(1 − γ) αK Note that |π ? | ≥ |π 0 |, and Vˆπ0 ≥ Vˆπ? − γ D Rmax /(1 − γ) since π 0 and π ? only differ from depth D onwards, under the chosen scenarios. It follows that with probability at least 1 − τ /2,     Rmax ln(8/τ ) + |π ? | ln(KD|A||Z|) Rmax 1−α ˆ Vπ? − γ D − . (7) Vπ ≥ 1+α 1−γ (1 + α)(1 − γ) αK By Lemma 3, with probability at least 1 − τ /2, we have r Rmax 2 ln(2/τ ) ˆ Vπ? ≥ Vπ? − . 1−γ K

(8)

By the union bound, with probability at least 1 − τ , both Inequality (7) and Inequality (8) hold, which imply Inequality (6) holds. This completes the proof.  References [1] David Haussler. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications. Information and computation, 100(1):78–150, 1992. [2] Yi Wang, Kok Sung Won, David Hsu, and Wee Sun Lee. Monte Carlo Bayesian Reinforcement Learning. arXiv preprint arXiv:1206.6449, 2012. [3] Arthur Cayley. A Theorem on Trees. Quart. J. Math, 23(376-378):69, 1889. [4] Wassily Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American statistical association, 58(301):13–30, 1963.

3

DESPOT: Online POMDP Planning with ... - NUS School of Computing

By Cayley's formula [3], the number of trees with i labeled nodes is i(i−2), thus ... the definition of a policy derivable from a DESPOT in Section 4 in the main text.

204KB Sizes 12 Downloads 278 Views

Recommend Documents

DESPOT: Online POMDP Planning with Regularization
Here we give a domain-independent construction, which is the average ... We evaluated the algorithms on four domains, including a very large one with about ...

Prescribed Learning of Indexed Families - NUS School of Computing
2 Department of Computer Science and Department of Mathematics, ... preserving-uniformly and class-preserving-prescribed learning instead of uniform and ..... Suppose by way of contradiction that M0,M1,M2,... witnesses that {L0,L1,L2,.

Prescribed Learning of Indexed Families - NUS School of Computing
2 Department of Computer Science and Department of Mathematics, ... preserving-uniformly and class-preserving-prescribed learning instead of uniform and ...

Conditional Random Field with High-order ... - NUS Computing
spurious, rare high-order patterns (by reducing the training data size), there is no .... test will iterate through all possible shuffles, but due to the large data sizes,.

Protecting Browsers from Extension Vulnerabilities - NUS Computing
attacks against a number of popular Firefox extensions [24]. In one example, if ... cious web page into the extension, the web site operator ... suggest that most extensions do not require the privilege to ... media types (such as PDF and Flash) or e

Intention-Aware Online POMDP Planning for Autonomous Driving in a ...
However, the use of POMDPs for robot planning under uncertainty is not widespread. Many are concerned about its reputedly high computational cost. In this work, we apply DESPOT [17], a state-of-the-art approx- imate online POMDP planning algorithm, t

Optimizing F-Measures: A Tale of Two Approaches - NUS Computing
[email protected]. Department of Computer Science, National University of Singapore, Singapore 117417. Kian Ming A. Chai [email protected].

Efficient Skyline Maintenance for Streaming Data ... - NUS Computing
with totally-ordered domains (as illustrated by the skyline hotel example), and. * Part of this ..... tuple t.dt is available virtually “for free” as part of the S-query issued to check if t is a ..... using a 100K buffer with different data dist

DESPOT
Wee Sun Lee. Department of Computer Science, National University of Singapore ... 2e−α2νn/M . Let Πi be the class of policy trees in Πb0,D,K and having size i.

Efficient Skyline Maintenance for Streaming Data ... - NUS Computing
with totally-ordered domains (as illustrated by the skyline hotel example), and. * Part of this ..... tuple t.dt is available virtually “for free” as part of the S-query issued to check if t is a ..... using a 100K buffer with different data dist

POMDP Homomorphisms
tion between existing Predictive State Representation methods and homomor- ... phism, then extend PSR reduction methods to find POMDP homomorphisms.

POMDP Homomorphisms
a. 3 o. 3. , P(t | s) = P(o. 1 o. 2 o. 3. | sa. 1 a. 2 a. 3. ) State represented by set of linearly independent tests: q i. ∈ Q. State Mapping: f(s. 1. ) = f(s. 2. ) ⇔ ∀ q i. P(q.

POMDP Homomorphisms
in practice inferring the abstract model directly from data without knowing M is ... However, the precise definition of a valid reduction and algorithms for finding .... The final two constraints encode consistency in the transition and observation f

prospectus 2014 - School of Planning and Architecture
Jun 2, 2014 - Students of. Architecture are members of NASA, the National Association of. Students of Architecture. Through these activities, the students.

prospectus 2014 - School of Planning and Architecture
Jun 2, 2014 - the country, John Terry, a Walter George protégé, replaced him. The first batch ... Hon'ble Minister of Human Resource Development, ...... in serial order of a merit list drawn up, based on the ..... contains synopsis of significant a

POMDP Homomorphisms - Semantic Scholar
reductions of Markov Processes [5, 6]. We formally define a POMDP homo- morphism, then extend PSR reduction methods to find POMDP homomorphisms.

school emergency planning
Apr 12, 2016 - It is imperative that each principal be prepared to address an emergency situation, which may require securing the building internally, evacuating of the facility, or the accommodation of those evacuated from a nearby facility. Procedu

Deceptive Answer Prediction with User Preference Graph - WING@NUS
Aug 9, 2013 - longer urls are more likely to be spam. 2). PageRank Score: We employ the PageRank. (Page et al., 1999) score of each URL as popularity score. 3.1.3 Phone Numbers and Emails. There are a lot of contact information mentioned in the Commu

Vasile Alecsandri - Despot Voda.pdf
din Hârl[u, care se joac[ la 18 noiembrie. Page 3 of 256. Vasile Alecsandri - Despot Voda.pdf. Vasile Alecsandri - Despot Voda.pdf. Open. Extract. Open with.

Secure Agent Mediated E-Commerce - School of Computing and ...
Application of agents in large-scale open distributed environment systems presents ..... Technologies for supporting development of semantic-agent system are ...

Secure Agent Mediated E-Commerce - School of Computing and ...
level of automation, a new paradigm of software is needed to handle ... problem in the agent paradigm further, we consider an example of a person who wishes .... e-commerce applications for Business-to-Consumer (B2C) transactions.

Secure Agent Mediated E-Commerce - School of Computing and ...
the Requirements for the Award of the Degree of Doctor of. Philosophy in Computer Science of Makerere University. Proposed ... Electronic commerce1 transactions can be distinguished depending on the nature of trans- ..... history that was published l