A Game-Theoretic Approach to Apprenticeship Learning — Supplement

Robert E. Schapire Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233 [email protected]

Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233 [email protected]

1

The MWAL Algorithm

For reference, the MWAL algorithm from the main paper is repeated below. Algorithm 1 The MWAL algorithm ˆE. 1: Given: An MDP\R M and an estimate of the expert’s feature expectations µ  −1 q k 2: Let β = 1 + 2 ln . T e µ) , ((1 − γ)(µ(i) − µ ˆ E (i)) + 2)/4, where µ ∈ Rk . 3: Define G(i, (1) 4: Initialize W (i) = 1 for i = 1, . . . , k. 5: for t = 1, . . . , T do

6:

Set w(t) (i) =

(t) (i) PW (t) W (i) i

for i = 1, . . . , k.

φ(s). 7: Compute an ǫP -optimal policy π ˆ (t) for M with respect to reward function R(s) = w(t) ·φ (t) (t) (t) ˆ of µ = µ(ˆ 8: Compute an ǫF -good estimate µ π ). e µ ˆ (t) )) for i = 1, . . . , k. 9: W (t+1) (i) = W (t) (i) · exp(ln(β) · G(i, 10: end for 11: Post-processing: Return the mixed policy ψ that assigns probability T1 to π ˆ (t) , for all t ∈ {1, . . . , T }.

1.1

e Differences between G and G

In the main paper, Algorithm 1 was motivated by appealing to the game matrix G(i, j) = µj (i) − µE (i), j where µ are the feature expectations of the jth deterministic policy. However, the algorithm actually uses e µ) = ((1 − γ)(µ(i) − µ ˆ E (i)) + 2)/4 G(i,

e follows. The rationale behind each of the differences between G and G

e depends on µ ˆ E instead of µE because µE is unknown and must be estimated. We • G account for the error of this estimate in the proof of Theorem 2. e is defined in terms of arbitrary feature expectations µ instead of µj because lines 7 and 8 • G ˆ (t) may not be the feature expectations of Algorithm 1 produce approximations, and hence µ of any deterministic policy. The results of Freund and Schapire [2] that we rely on are not affected by this change. 1

e is shifted and scaled so that G(i, e µ) ∈ [0, 1]. This is necessary in order to directly apply • G the main result of Freund and Schapire [2].

The last point relies on a simplifying assumption. Recall that if µ is a vector of feature expectations 1 k for some policy, then µ ∈ [0, 1−γ ] , because φ (s) ∈ [0, 1]k for all s. For simplicity, we will assume that this holds even if µ is an estimate of a vector of feature expectations. (This is without loss of generality: if it does not hold, we can trim µ so that it falls within the desired range without −2 2 k e µ) ∈ [0, 1]. ˆ E ) ∈ [ 1−γ increasing the error in the estimate.) Therefore (µ−µ , 1−γ ] , and hence G(i,

2

Proof of Theorem 2

In this section we prove Theorem 2 from the main paper. Theorem 2. Given an MDP\R M , and m independent trajectories from an expert’s policy πE . Suppose we execute the MWAL algorithm for T iterations. Let ψ be the mixed policy returned by the algorithm. Let ǫF and ǫP be the approximation errors from lines 7 and 8 of the algorithm. Let H ≥ (1/(1 − γ)) ln(1/(ǫH (1 − γ))) be the length of each sample trajectory. Let ǫR = minw∈Sk maxs |R∗ (s) − w · φ (s)| be the representation error of the features. Let v ∗ = maxψ∈Ψ minw∈Sk [w · µ(ψ) − w · µE ] be the game value. Then in order for V (ψ) ≥ V (πE ) + v ∗ − ǫ

(1)

to hold with probability at least 1 − δ, it suffices that T



m



9 ln k − γ))2 2k 2 ln ′ 2 (ǫ (1 − γ)) δ 2(ǫ′ (1

(2) (3) (4)

where ǫ′ ≤

ǫ − (2ǫF + ǫP + 2ǫH + 2ǫR /(1 − γ)) . 3

(5)

To prove Theorem 2, we will first need to prove several auxiliary results. Define e G(w, µ) ,

k X i=1

e µ). w(i) · G(i,

Now we can directly apply the main result from Freund and Schapire [2], which we will call the MW Theorem. MW Theorem. At the end of the MWAL algorithm

where

T T X 1 1 X e (t) (t) e ˆ (t) ) + ∆T ˆ )≤ G(w, µ G(w , µ min T t=1 T w∈Sk t=1

∆T =

r

2 ln k ln k + . T T

Proof. Freund and Schapire [2]. The following corollary follows straightforwardly from the MW Theorem. Corollary 1. At the end of the MWAL algorithm T h T i i X 1 1 X h (t) (t) ˆ (t) − w · µ ˆ E + ∆T ˆ − w(t) · µ ˆE ≤ w·µ min w ·µ T t=1 T w∈Sk t=1

2

ˆ E close to µE . The next lemma bounds the number of samples needed to make µ ˆ E −µE k∞ ≤ Lemma 1. Suppose the trajectory length H ≥ (1/(1−γ)) ln(1/(ǫH (1−γ))). For kµ ǫ + ǫH to hold with probability at least 1 − δ, it suffices that   2 2k m≥ ln 2 (ǫ(1 − γ)) δ Proof. This is a standard proof using Hoeffding’s inequality, similar to that found in Abbeel and Ng ˆ E is not an unbiased estimate of µE , because the [1]. However, care must be taken in one respect: µ trajectories are truncated at H. So define "H # X H t µE , E γ φ (st ) πE , θ, D . t=0

Then we have,

∀i ∈ [1, . . . , k]

2 ˆ E (i) − µH Pr(|µ E (i)| ≥ ǫ) ≤ 2 exp(−m(ǫ(1 − γ)) /2)



2 ˆ E (i) − µH Pr(∃i ∈ [1, . . . , k] s.t. |µ E (i)| ≥ ǫ) ≤ 2k exp(−m(ǫ(1 − γ)) /2)



2 ˆ E (i) − µH Pr(∀i ∈ [1, . . . , k], |µ E (i)| ≤ ǫ) ≥ 1 − 2k exp(−m(ǫ(1 − γ)) /2)



2 ˆ E − µH Pr(kµ E k∞ ≤ ǫ) ≥ 1 − 2k exp(−m(ǫ(1 − γ)) /2)

1 k We used in order: Hoeffding’s inequality and µH E ∈ [0, 1−γ ] ; the union bound; the probability of disjoint events; the definition of L∞ norm.

It is not hard to show that kµH E − µE k∞ ≤ ǫH (see Kearns and Singh [4], Lemma 2). Hence if 2k 2 m ≥ (ǫ(1−γ))2 ln( δ ), then with probabilty at least 1 − δ we have H ˆ E − µE k∞ ≤ kµ ˆ E − µH kµ E k∞ + kµE − µE k∞ ≤ ǫ + ǫH .

The next lemma bounds the impact of “representation error”: it says that if R∗ (s) and w∗ · φ (s) are not very different, then neither are V (ψ) and w∗ · µ(ψ). ǫR Lemma 2. If maxs |R∗ (s) − w∗ · φ (s)| ≤ ǫR , then |V (ψ) − w∗ · µ(ψ)| ≤ 1−γ for every MDP/R M and mixed policy ψ. Proof.

= = = ≤ ≤

|V (ψ) − w∗ · µ(ψ)| "∞ # "∞ # X X t ∗ t ∗ γ R (st ) − E γ w · φ (st ) E t=0 t=0 "H # "H # X X t ∗ t ∗ γ R (st ) − lim E γ w · φ (st ) lim E H→∞ H→∞ t=0 t=0 "H # X γ t (R∗ (st ) − w∗ · φ (st )) lim E H→∞ t=0 "H # X lim E γ t |R∗ (st ) − w∗ · φ(st )| H→∞

t=0

ǫR 1−γ

We are now ready to prove Theorem 2. The proof closely follows Section 2.5 of Freund and Schapire [2]. 3

Proof of Theorem 2. Let w = v∗

= = ≤ ≤ = ≤ ≤ ≤ ≤

1 T

PT

t=1

w(t) . Then we have

max min [w · µ(ψ) − w · µE ]

ψ∈Ψ w∈Sk

min max [w · µ(ψ) − w · µE ]

(6)

ˆ E ] + ǫ′ + ǫH min max [w · µ(ψ) − w · µ

(7)

w∈Sk ψ∈Ψ

w∈Sk ψ∈Ψ

ˆ E ] + ǫ′ + ǫH max [w · µ(ψ) − w · µ

ψ∈Ψ

T i 1 X h (t) ˆ E + ǫ′ + ǫH w · µ(ψ) − w(t) · µ ψ∈Ψ T t=1

(8)

T i 1 X h (t) ˆ E + ǫP + ǫ′ + ǫH w · µ(ˆ π (t) ) − w(t) · µ T t=1

(9)

max

T i h 1X ˆ E + ǫ′ + ǫH max w(t) · µ(ψ) − w(t) · µ T t=1 ψ∈Ψ

T i 1 X h (t) (t) ˆ − w(t) · µ ˆ E + ǫF + ǫP + ǫ′ + ǫH w ·µ T t=1

T h i X 1 ˆ (t) − w · µ ˆ E + ∆T + ǫF + ǫP + ǫ′ + ǫH w·µ min T w∈Sk t=1

T h i X 1 ˆ E + ∆T + 2ǫF + ǫP + ǫ′ + ǫH w · µ(ˆ π (t) ) − w · µ min T w∈Sk t=1   ˆ E + ∆T + 2ǫF + ǫP + ǫ′ + ǫH = min w · µ(ψ) − w · µ w∈Sk   ≤ min w · µ(ψ) − w · µE + ∆T + 2ǫF + ǫP + 2ǫ′ + 2ǫH



w∈Sk ∗

≤ w · µ(ψ) − w∗ · µE + ∆T + 2ǫF + ǫP + 2ǫ′ + 2ǫH ′

≤ V (ψ) − V (πE ) + ∆T + 2ǫF + ǫP + 2ǫ + 2ǫH + (2ǫR )/(1 − γ)

(10)

(11)

(12) (13) (14) (15) (16)

In (6), we used von Neumann’s minmax theorem. In (7), Lemma 1. In (8), the definition of w. In (9), ˆ (t) is an ǫF -good estimate the fact that π ˆ t is ǫP -optimal w.r.t. R(s) = wt ·φ(s). In (10), the fact that µ (t) (t) ˆ is an ǫF -good estimate of µ(ˆ of µ(ˆ π ). In (11), Corollary 1. In (12), again the fact that µ π (t) ). ∗ ∗ In (13), the definition of ψ. In (14), Lemma 1. In (15), we let w = arg minw∈Sk maxs |R (s) − (w · φ (s))|. In (16), Lemma 2. Plugging in the choice for T into ∆T and rearranging implies the theorem.

3

When transition function is unknown

We will employ several technical lemmas developed in Kearns and Singh [4] and Abbeel and Ng [5]. This is not a complete proof, but just a sketch of the main components of one. For an MDP/R M = (S, A, γ, θ, φ ), suppose that we know θ(s, a, ·) exactly on a subset Z ⊆ S ×A. Then we can construct a estimate MZ of M according to the following definition, which is similar to Definition 9 in Kearns and Singh [4]. Definition 1. Let M = (S, A, γ, θ, φ ) be a MDP/R, and let Z ⊆ S × A. Then the induced MDP/R MZ = (S ∪ {s0 }, A, γ, θZ , φ Z ) is defined as follows, where SZ = {s : (s, a) ∈ Z for some a ∈ A}: • θZ (s0 , a, s0 ) = 1 for all a ∈ A, i.e. s0 is an absorbing state. • If (s, a) ∈ Z and s′ ∈ SZ , then θZ (s, a, s′ ) = θ(s, a, s′ ). 4

• If (s, a) ∈ Z, then θZ (s, a, s0 ) = 1 − • If (s, a) ∈ / Z, then θZ (s, a, s0 ) = 1.

P

s′ ∈SZ

θ(s, a, s′ ).

• φ Z (s) = φ (s) for all s ∈ S, and φ Z (s0 ) = −1, where −1 is the k-length vector of all −1’s. The following lemma, due to Kearns and Singh [4] (Lemma 7), shows that MZ is essentially a pessimistic estimate for M . Lemma 3. Let M = (S, A, γ, θ, φ ) be a MDP/R where φ (s) ∈ [−1, 1]k , and let Z ⊆ S × A. Then for all w ∈ Sk and ψ ∈ Ψ, we have w · µ(ψ, M ) ≥ w · µ(ψ, MZ ). Proof. As above, let SZ = {s : (s, a) ∈ Z for some a ∈ A}. Also let AZ = {a : (s, a) ∈ Z for some s ∈ S}. All transitions in MZ between states in SZ using an action in AZ are the same as in M , while all other transitions are routed to the absorbing state s0 . Observing that φ (s0 ) = −1 and φ (s)  −1 for all s proves the lemma. Definition 2. Let M = (S, A, γ, θ, φ ) be an MDP/R. Let H be the length of each sample trajectory from the expert’s policy. Then we say a subset Z ⊆ S × A is (η, H)-visited by πE in M if   η Z = (s, a) Pr(∃t ∈ [1, . . . , H] such that (st , at ) = (s, a) | πE , M ) ≥ . (17) |S||A|

The following lemma, due to Abbeel and Ng [5], says that if Z ⊆ S × A is (η, H)-visited by πE in M , then πE has a similar value in MZ as it does in M . Lemma 4. Let M = (S, A, γ, θ, φ ) be a MDP/R, let H ≥ (1/(1 − γ)) ln(1/(ǫH (1 − γ))), and let Z ⊆ S × A be (η, H)-visited by πE in M . Then for all w ∈ Sk η + ǫH . (18) |w · µ(πE , M ) − w · µ(πE , MZ )| ≤ 1−γ Proof. By the definition of MZ and the union bound, we have Pr({(st , at )}H t=1 ⊆ Z | πE , MZ ) = Pr({(st , at )}H t=1 ⊆ Z | πE , M ) ≥ 1 − η. Now suppose w · µ(πE , M ) ≥ w · µ(πE , MZ ). Then |w · µ(πE , M ) − w · µ(πE , MZ )| "H # " ∞ # X X = E γ t w · φ (st ) πE , M + E γ t w · φ (st ) πE , M t=0

−E

H X t=0

(20)

t=H+1

# " ∞ # X γ t w · φ (st ) πE , MZ − E γ t w · φ (st ) πE , MZ

(21)

t=H+1

γ H+1 1 − γH + 1−γ 1−γ η + ǫH 1−γ

≤ η ≤

"

(19)

(22) (23)

A parallel argument can be made in case w · µ(πE , M ) ≤ w · µ(πE , MZ ). Since we will not know MZ exactly, we will need to estimate it. The following lemma, due to c do not differ much, then the Abbeel and Ng [5] (Lemma 14), says that if two MDP/R’s M and M c value of the same policy in M and M is not very different. b φ ) be two MDP/R’s that differ only in c = (S, A, γ, θ, Lemma 5. Let M = (S, A, γ, θ, φ ) and M their transition functions. Suppose θ and θb satisfy b a, ·)k1 ≤ ǫ. ∀s ∈ S, a ∈ A kθ(s, a, ·), θ(s,

(24)

k

Then for all ψ ∈ Ψ and w ∈ S , we have c) ≤ w · µ(ψ, M ) − w · µ(ψ, M 5

2ǫ . (1 − γ)2

(25)

The following lemma, due to Abbeel and Ng [5] (Lemma 17), bounds the number of trajectories needed from πE to make θ and θb similar on a subset Z ⊆ S × A that is (η, H)-visited by πE . Lemma 6. Let M = (S, A, γ, θ, φ ). Let Z ⊆ S × A be (ǫ, H)-visited by πE in M . Let θb be the MLE for θ formed by observing m independent trajectories from πE . Also, let K(s, a) denote the actual number of times (s, a) is visited in the m trajectories. Then for |S|2 |S|3 |A| ln 4ǫ2 ǫ b a, ·)k1 ≤ ǫ ∀(s, a) ∈ Z, kθ(s, a, ·), θ(s,

∀(s, a) ∈ Z, K(s, a) ≥

(26) (27)

to hold with probability 1 − δ, it suffices that m≥ 3.1

|S|3 |A| |S|3 |A| 2|S||A| ln + |S||A| ln . 8ǫ3 δǫ δ

(28)

Putting it all together

Here is the algorithm: 3

3

ln |S|δǫ|A| + |S||A| ln 2|S||A| sample trajectories from the expert. 1. Collect m ≥ |S|8ǫ|A| 3 δ 2. Define the following: (a) Let Z be the set of all state-action pairs (s, a) such that K(s, a) ≥ (b) Let θb be the MLE for θ. b φ ). c = (S, A, γ, θ, (c) Let M = (S, A, γ, θ, φ ) and M

|S|2 4ǫ2

3

ln |S| ǫ|A| .

cZ and µ ˆ E to the MWAL algorithm, which returns ψ. 3. Submit M

Lemma 3 shows that V (ψ, M ) is more than V (ψ, MZ ). Lemma 5 says that V (ψ, MZ ) is close cZ ). Since M cZ is the MDP\R that we gave to the MWAL algorithm, Theorem 2 says that V (ψ, M cZ ) is more than V (πE , M cZ ). Lemma 5 says that V (πE , M cZ ) is close to V (πE , MZ ). V (ψ, M Lemma 4 says that V (πE , MZ ) is close to V (πE , M ).

References [1] P. Abbeel, A. Ng (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ICML 21 [2] Y. Freund, R. E. Schapire (1996). Game Theory, On-line Prediction and Boosting. COLT 9 [3] Y. Freund, R. E. Schapire (1999). Adaptive Game Playing Using Multiplicative Weights. Games and Economic Behavior 29, 79–103. [4] M. Kearns, S. Singh (2002). Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning 49, 209–232. [5] P. Abbeel, A. Ng (2005). Exploration and Apprenticeship Learning in Reinforcement Learning. ICML 22 (Long version; available at http://www.cs.stanford.edu/˜pabbeel/)

6

A Game-Theoretic Approach to Apprenticeship ... - Semantic Scholar

The following lemma, due to Kearns and Singh [4] (Lemma 7), shows that MZ is essentially a pessimistic estimate for M. Lemma 3. Let M = (S, A, γ, θ, φ) be a MDP/R where φ(s) ∈ [−1, 1]k, and let Z ⊆S×A. Then for all w ∈ Sk and ψ ∈ Ψ, we have w · µ(ψ,M) ≥ w · µ(ψ,MZ). Proof. As above, let SZ = {s : (s, a) ∈ Z for some a ...

81KB Sizes 0 Downloads 301 Views

Recommend Documents

A Game-Theoretic Approach to Apprenticeship ... - Semantic Scholar
[1] P. Abbeel, A. Ng (2004). Apprenticeship Learning via Inverse Reinforcement Learning. ... Near-Optimal Reinforcement Learning in Polynomial Time. Ma-.

A Bidirectional Transformation Approach towards ... - Semantic Scholar
to produce a Java source model for programmers to implement the system. Programmers add code and methods to the Java source model, while at the same time, designers change the name of a class on the UML ... sively studied by researchers on XML transf

A Bidirectional Transformation Approach towards ... - Semantic Scholar
to produce a Java source model for programmers to implement the system. Programmers add code and methods to ... synchronized. Simply performing the transformation from UML model to Java source model again ... In: ACM SIGPLAN–SIGACT Symposium on Pri

A Machine-Learning Approach to Discovering ... - Semantic Scholar
potential website matches for each company name based on a set of explanatory features extracted from the content on each candidate website. Our approach ...

The Subjective Approach to Ambiguity: A Critical ... - Semantic Scholar
Oct 8, 2008 - Bayesian model along these lines. We will argue .... with a difference: one would expect the forces of learning, introspection and incentives to ...

A Machine Learning Approach to Automatic Music ... - Semantic Scholar
by an analogous-to-digital converter into a sequence of numeric values in a ...... Proceedings of the 18th. Brazilian Symposium on Artificial Intelligence,.

A Machine-Learning Approach to Discovering ... - Semantic Scholar
An important application that requires reliable website identification arises ... ferent company that offers website hosting services to other companies. In other ...

The Inductrack: A Simpler Approach to Magnetic ... - Semantic Scholar
risen to twice the transition speed the levitating force has already reached 80 percent of its asymptotic value. Inductrack systems do not require reaching high speeds before lifting off their auxiliary wheels. From the theory the magnet weight requi

A New Approach to Linear Filtering and Prediction ... - Semantic Scholar
This paper introduces a new look at this whole assemblage of problems, sidestepping the difficulties just mentioned. The following are the highlights of the paper: (5) Optimal Estimates and Orthogonal Projections. The. Wiener problem is approached fr

A Reuse-Based Approach to Determining Security ... - Semantic Scholar
declarative statements about the degree of protection required [17]. Another ..... be Internet script kiddies, business competitors or disgruntled employees. ..... administration risk analysis and management method conforming to ISO15408 (the ...

A Uniform Approach to Inter-Model Transformations - Semantic Scholar
i=1(∀x ∈ ci : |{(v1 ::: vm)|(v1 ::: vm)∈(name c1 ::: cm) Avi = x}| ∈ si). Here .... uates to true, then those instantiations substitute for the same free variables in ..... Transactions on Software Engineering and Methodology, 6(2):141{172, 1

The Subjective Approach to Ambiguity: A Critical ... - Semantic Scholar
Oct 8, 2008 - ¬I, investing here amounts to paying S dollars in exchange for improving ...... acterized by a stock of models, or analogies, who respond to strategic .... Why is this superior to other behavioral or ad hoc explanations that fit the.

Listwise Approach to Learning to Rank - Theory ... - Semantic Scholar
We give analysis on three loss functions: likelihood .... We analyze the listwise approach from the viewpoint ..... The elements of statistical learning: Data min-.

An Ontology-driven Approach to support Wireless ... - Semantic Scholar
enhance and annotate the raw data with semantic meanings. • domain ontology driven network intelligent problem detection and analysis. • user-friendly visual ...

an approach to lossy image compression using 1 ... - Semantic Scholar
In this paper, an approach to lossy image compression using 1-D wavelet transforms is proposed. The analyzed image is divided in little sub- images and each one is decomposed in vectors following a fractal Hilbert curve. A Wavelet Transform is thus a

Bayesian Approach To Derivative Pricing And ... - Semantic Scholar
(time and asset price) — for which a practical satisfactory calibration method ..... Opponents of the Bayesian approach to data analysis often argue that it is fun-.

an approach to lossy image compression using 1 ... - Semantic Scholar
images are composed by 256 grayscale levels (8 bits- per-pixel resolution), so an analysis for color images can be implemented using this method for each of ...

Bayesian Approach To Derivative Pricing And ... - Semantic Scholar
The applications of constructing a distribution of calibration parameters are broad and far ... are introduced and the development of methods to manage model ...

A Discriminative Learning Approach for Orientation ... - Semantic Scholar
180 and 270 degrees because usually the document scan- ning process results in .... features, layout and font or text-printing technology. In Urdu publishing ...

A Topological Approach for Detecting Twitter ... - Semantic Scholar
marketing to online social networking sites. Existing methods ... common interest [10–12], these are interaction-based methods which use tweet- ..... categories in Twitter and we selected the five most popular categories among them.3 For each ...

Error Correction on a Tree: An Instanton Approach - Semantic Scholar
Nov 5, 2004 - of edges that originate from a node are referred to as its degree. In this Letter we discuss primarily codes with a uniform variable and/or check node degree distribution. Note that relations between the ..... [9] J. S. Yedidia,W. T. Fr

A multiagent approach for diagnostic expert ... - Semantic Scholar
Expert Systems with Applications 27 (2004) 1–10 ... Web. Expert systems running on the Internet can support a large group of users who .... One of the best advantages of the ... host names of the target agent, and the port number on which the ...

An Agent-based Approach to Health Care ... - Semantic Scholar
Dept. of Electronic Engineering, Queen Mary & Westfield College, Mile. End Road ... Abbreviated title: Agent-based health care management ... care occurred - a serious concern as time is such a critical factor in care administration). Thirdly ...

A Appendix - Semantic Scholar
buyer during the learning and exploit phase of the LEAP algorithm, respectively. We have. S2. T. X t=T↵+1 γt1 = γT↵. T T↵. 1. X t=0 γt = γT↵. 1 γ. (1. γT T↵ ) . (7). Indeed, this an upper bound on the total surplus any buyer can hope