Robustness of Bayesian Pool-based Active Learning Against Prior Misspecification: Supplementary Material Nguyen Viet Cuong1

Nan Ye2

Wee Sun Lee3

1 2

Department of Mechanical Engineering, National University of Singapore, Singapore, [email protected] Mathematical Sciences School & ACEMS, Queensland University of Technology, Australia, [email protected] 3 Department of Computer Science, National University of Singapore, Singapore, [email protected]

Proof of Corollary 1 Cuong et al. (2013) showed that the maximum Gibbs error algorithm provides a constant factor approximation to the optimal policy Gibbs error, which is equivalent to the expected version space reduction fpavg (π). Formally, they showed that, for any prior p,   1 avg fp (A(p)) ≥ 1 − max fpavg (π), π e where A is the maximum Gibbs error algorithm. That is, the algorithm is average-case (1 − 1/e)-approximate. Furthermore, the version space reduction utility is upper bounded by M = 1; and for any priors p, p0 , we also have |fp (S, h) − fp0 (S, h)| = |p0 [h(S); S] − p[h(S); S]| X = | p0 [h0 ] P[h0 (S) = h(S)|h0 ] h0

fp0 (xπh∗ , h∗ ) ≥ ≥ = ≥ =

fp1 (xπh∗ , h∗ ) − Lkp0 − p1 k min fp1 (xπh , h) − Lkp0 − p1 k

h worst fp1 (π) − Lkp0 − p1 k α max fpworst (π) − Lkp0 − p1 k 1 π worst αfp1 (π1 ) − Lkp0 − p1 k,

where the last inequality holds as A is α-approximate. Using the inequality relating fpworst (π1 ) and fpworst (π0 ) above, we 1 0 now have fpworst (π) ≥ α(fpworst (π0 ) − Lkp0 − p1 k) − Lkp0 − p1 k 0 0 = α max fpworst (π) − (α + 1)Lkp0 − p1 k. 0 π

Proof of Corollary 3 −

X

p[h0 ] P[h0 (S) = h(S)|h0 ]|

h0

Let π = A(p1 ) and h∗ = arg minh fp0 (xπh , h). We have = minh fp0 (xπh , h) = fp0 (xπh∗ , h∗ ). By the Lipschitz continuity of fp , we have

fpworst (π) 0

0

kp − p k.

Thus, the version space reduction utility is Lipschitz continuous with L = 1 and is upper bounded by M = 1. Hence, Corollary 1 follows from Theorem 1.

Proof of Theorem 2 Let π0 = arg maxπ fpworst (π) and π1 = arg maxπ fpworst (π). 0 1 π0 worst We have fp1 (π1 ) ≥ fpworst (π ) = f (x , h ), where 0 p 0 1 h0 1 h0 = arg minh fp1 (xπh0 , h). Using the Lipschitz continuity of fp and the definition of fpworst , we have 0 fp1 (xπh00 , h0 ) ≥ fp0 (xπh00 , h0 ) − Lkp0 − p1 k ≥ min fp0 (xπh0 , h) − Lkp0 − p1 k =

h worst fp0 (π0 )

− Lkp0 − p1 k.

Thus, fpworst (π1 ) ≥ fpworst (π0 ) − Lkp0 − p1 k. 1 0 c 2016, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.

Cuong, Lee, and Ye (2014) have shown that using the least confidence algorithm can achieve a constant factor approximation to the optimal worst-case version space reduction. Formally, if fp (S, h) is the version space reduction utility (that was considered previously for the maximum Gibbs error algorithm), then fpworst (π) is the worst-case version space reduction of π, and it was shown (Cuong, Lee, and Ye 2014) that, for any prior p,   1 worst max fpworst (π), fp (A(p)) ≥ 1 − π e where A is the least confidence algorithm. That is, the least confidence algorithm is worst-case (1 − 1/e)-approximate. Since the version space reduction utility is Lipschitz continuous with L = 1 as shown in the proof of Corollary 1, Corollary 3 follows from Theorem 2.

Proof of Corollary 4 It was shown by Cuong, Lee, and Ye (2014) that, for any prior p,   1 worst tp (A(p)) ≥ 1 − max tworst (π), p π e

where A is the worst-case generalized Gibbs error algorithm. That is, the worst-case generalized Gibbs error algorithm is worst-case (1 − 1/e)-approximate. If we assume the loss function L is upper bounded by a constant m, then tp is Lipschitz continuous with L = 2m. Indeed, for any S, h, p, and p0 , we have

=

|tp (S, h) − tp0 (S, h)| X | L(h0 , h00 )(p[h0 ]p[h00 ] − p0 [h0 ]p0 [h00 ])| h0 (S)6=h(S) or h00 (S)6=h(S)

X

m

|p[h0 ]p[h00 ] − p0 [h0 ]p0 [h00 ]|

0

h (S)6=h(S) or h00 (S)6=h(S)

X

= m

|(p[h0 ] − p0 [h0 ])p[h00 ]

Proof of Theorem 4

h0 (S)6=h(S) or h00 (S)6=h(S)

+ p0 [h0 ](p[h00 ] − p0 [h00 ])| ≤ m

X

2( 21 − µ − δ) + 2( 12 − µ − δ) + 2(µ + δ) + 2(µ + δ) = 2. Thus, π1 is an average-case optimal policy for p1 and A∗ is an exact algorithm for p1 in the average case. (π0 ). Thus, π1 is a (π1 ) = 2 = fpworst Similarly, fpworst 1 1 worst-case optimal policy for p1 and A∗ is an exact algorithm for p1 in the worst case. Hence, A∗ is an exact algorithm for p1 in both average and worst cases. 1 1 Considering p0 , we have fpavg 0 (π1 ) = 0( 2 − µ) + 0( 2 − µ) + 2µ + 2µ = 0 since µ = 0 in the average case. On the 1 1 other hand, fpavg 0 (π0 ) = 1( 2 − µ) + 1( 2 − µ) + 1µ + 1µ = 1. Similarly, in the worst case, we also have fpworst (π1 ) = 0 0 and fpworst (π ) = 1. Thus, π is the optimal policy for p0 in 0 0 0 both average and worst cases. Now given any C, α,  > 0, we can choose a small enough δ such that kp1 − p0 k <  and α − Ckp1 − p0 k > 0. Hence, Theorem 3 holds.

(|p[h0 ] − p0 [h0 ]|p[h00 ] + p0 [h0 ]|p[h00 ] − p0 [h00 ]|)

For any policy π, note that X X avg |cavg p0 [h]c(π, h) − p1 [h]c(π, h)| p0 (π) − cp1 (π)| = | h

h0 ,h00

=

= |

2mkp − p0 k.

h

X (p0 [h] − p1 [h])c(π, h)| h

Thus, Corollary 4 follows from Theorem 2.

Proof of Theorem 3 For both the average and worst cases, consider the AL problem with budget k = 1 and the utility fp (S, h) = |{h0 : p[h0 ] > µ and h0 (S) 6= h(S)}|, for some very small µ > 0 in the worst case and µ = 0 in the average case. This utility returns the number of hypotheses that have a significant probability (greater than µ) and are not consistent with h on S. When µ = 0, it is the number of hypotheses pruned from the version space. So, this is a reasonable utility to maximize for AL. It is easy to see that this utility is nonLipschitz. Consider the case where there are two examples x0 , x1 and 4 hypotheses h1 , . . . , h4 with binary labels given according to the following table.

≤ Kkp0 − p1 k, where the last inequality holds as c(π, h) is upper bounded by K. Thus, avg cavg p0 (π) ≤ cp1 (π) + Kkp0 − p1 k, for all π, p0 , p1 . Let π0 = arg minπ cavg p0 (π). We have avg cavg (A(p )) ≤ c (A(p 1 1 )) + Kkp0 − p1 k p0 p1 ≤ α(p1 ) min cavg p1 (π) + Kkp0 − p1 k ≤ ≤ = =

π avg α(p1 )cp1 (π0 ) + Kkp0 − p1 k α(p1 )(cavg p0 (π0 ) + Kkp0 − p1 k) + Kkp0 − p1 k avg α(p1 )cp0 (π0 ) + (α(p1 ) + 1)Kkp0 − p1 k α(p1 ) min cavg p0 (π) + (α(p1 ) + 1)Kkp0 − p1 k, π

where the first and fourth inequalities are from the discussion above, and the second inequality is from the fact that A is α(p)-approximate.

Proof of Theorem 5 Hypothesis

x0

x1

h1 h2 h3 h4

0 1 0 1

0 0 1 1

Consider the true prior p0 where p0 [h1 ] = p0 [h2 ] = 12 −µ and p0 [h3 ] = p0 [h4 ] = µ, and a perturbed prior p1 where p1 [h1 ] = p1 [h2 ] = 12 − µ − δ and p1 [h3 ] = p1 [h4 ] = µ + δ, for some small δ > 0. With budget k = 1, there are two possible policies: the policy π0 which chooses x0 and the policy π1 which chooses 1 x1 . Let A∗ (p1 ) = π1 . Note that fpavg 1 (π1 ) = 2( 2 − µ − δ) + 1 2( 2 − µ − δ) + 2(µ + δ) + 2(µ + δ) = 2, and fpavg 1 (π0 ) =

If k = 1, then p1 = p0 and Theorem 5 trivially holds. Consider k ≥ 2. For any h, since p0 [h] p0 [h] p0 [h] = Pk 1 ≤ 1 = k, p1 [h] k p0 [h] i=1 k p1,i [h] [h] we have k −1 ≥ 1− pp10 [h] ≥ 1−k. Thus, |1− pp01 [h] [h] | ≤ k −1.

Hence, for any policy π, avg |cavg p1 (π) − cp0 (π)| = |

X

p1 [h](1 −

h

≤ (k − 1) =

(k −

X

p0 [h] )c(π, h)| p1 [h]

p1 [h]c(π, h)

h 1)cavg p1 (π).

avg Therefore, cavg p0 (π) ≤ kcp1 (π).

On the other hand, for any h, we have P Pk 1 1 1 p1 [h] i:p1,i 6=p0 k p1,i [h] k p0 [h] + i=1 k p1,i [h] = = p0 [h] p0 [h] p0 [h] k−1 1 1 k−1 k + = + . k minh p0 [h] k k minh p0 [h]

1 k−1 Thus, 1 − k1 ≥ 1 − pp10 [h] [h] ≥ 1 − k − k minh p0 [h] . When H contains at least 2 hypothesis, minh p0 [h] ≤ 1/2, and 1 k−1 1 k + k minh p0 [h] − 1 ≥ 1 − k (the case when H is a singleton is equivalent to k = 1). Hence,

|1 −

p1 [h] 1 k−1 |≤ + − 1. p0 [h] k k minh p0 [h]

We have avg |cavg p0 (π) − cp1 (π)| = |

X

p0 [h](1 −

h

p1 [h] )c(π, h)| p0 [h]

1 k−1 ≤ ( + − 1)cavg p0 (π). k k minh p0 [h] avg k−1 1 Therefore, cavg p1 (π) ≤ ( k + k minh p0 [h] )cp0 (π). Now let π0 = arg minπ cavg p0 (π). We have avg cavg p0 (A(p1 )) ≤ kcp1 (A(p1 ))

≤ kα(p1 ) min cavg p1 (π) ≤

π kα(p1 )cavg p1 (π0 )

(first part)

k−1 1 + )cavg (π0 ) k k minh p0 [h] p0 k−1 ) min cavg (π), = α(p1 )(1 + minh p0 [h] π p0 ≤ kα(p1 )(

where the first and fourth inequalities are from the discussions above, and the second inequality is from the fact that A is α(p)-approximate. If A is the generalized binary search algorithm, then α(p1 ) = ln minh1p1 [h] + 1. Note that Pk minh p1 [h] = minh i=1 k1 p1,i [h] ≥ minh k1 p0 [h]. Thus, α(p1 ) ≤ ln minhkp0 [h] + 1. Therefore, avg k k−1 cavg p0 (A(p1 )) ≤ (ln minh p0 [h] +1)( minh p0 [h] +1)min cp0 (π). π

References Cuong, N. V.; Lee, W. S.; Ye, N.; Chai, K. M. A.; and Chieu, H. L. 2013. Active learning for probabilistic hypotheses using the maximum Gibbs error criterion. In NIPS. Cuong, N. V.; Lee, W. S.; and Ye, N. 2014. Near-optimal adaptive pool-based active learning with general loss. In UAI.

## Robustness of Bayesian Pool-based Active Learning ...

Wee Sun Lee3. 1Department of Mechanical Engineering, National University of Singapore, Singapore, [email protected] 2Mathematical Sciences School & ACEMS, Queensland University of Technology, Australia, [email protected] 3Department of Computer Science, National University of Singapore, Singapore, ...

#### Recommend Documents

The Sample Complexity of Self-Verifying Bayesian Active Learning
Carnegie Mellon University [email protected] Department of Statistics. Carnegie Mellon University [email protected] Language Technologies Institute.

Incremental Learning of Nonparametric Bayesian ...
Jan 31, 2009 - Conference on Computer Vision and Pattern Recognition. 2008. Ryan Gomes (CalTech) ... 1. Hard cluster data. 2. Find the best cluster to split.

Theory of Active Learning - Steve Hanneke
Sep 22, 2014 - This contrasts with passive learning, where the labeled data are taken at random. ... However, the presentation is intended to be pedagogical, focusing on results that illustrate ..... of observed data points. The good news is that.

Theoretical Foundations of Active Learning
Submitted in partial fulfillment of the requirements for the degree of Doctor of ... This thesis is dedicated to the many teachers who have helped me along the way. ...... We list a few elementary properties below. Their proofs ...... argument and al

Theory of Active Learning - Steve Hanneke
Sep 22, 2014 - cludes researchers and advanced graduate students in machine learning and statistics ... 8 A Survey of Other Topics and Techniques. 151 ..... Figure 1.1: An illustration of the concepts involved in disagreement-based active.

Efficient Active Learning with Boosting
unify semi-supervised learning and active learning boosting. Minimization of ... tant, we derive an efficient active learning algorithm under ... chine learning and data mining fields . ... There lacks more theoretical analysis for these ...... I

An Evidence Framework For Bayesian Learning of ...
data is sparse, noisy and mismatched with test. ... In an evidence Bayesian framework, we can build a better regularized HMM with ... recognition performance.