Asynchronous Parallel Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeﬀ Schneider, Barnabás Póczos AutoML Workshop, ICML, Sydney, Aug 2017 Summary:

Parallelised Thompson Sampling for BO

• Setting: Bayesian optimisation with parallel evaluations. • A direct application of Thompson sampling in synchronous and asynchronous parallel settings does essentially as well as if the evaluations were made in sequence. • When evaluation time is factored in, the asynchronous version outperforms the synchronous and sequential versions. • Proposed methods are conceptually and computationally much simpler than existing methods for parallel BO.

A straightforward application of TS to parallel settings: Asynchronous: asyTS Synchronous: synTS

Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for all workers to ﬁnish. 2. Dj ← Dj−1 ∪ {(xm, ym)}M m=1 where (xm, ym) are worker m’s query/observation. 3. Compute posterior GP with Dj. 4. Draw m samples gm ∼ GP, ∀m. 5. Deploy worker m at argmax gm, ∀m.

Gaussian Process (Bayesian) Optimisation Expensive Blackbox Function Examples: - Hyper-parameter Tuning - ML estimation in Astrophysics - Optimal policy in Autonomous Driving

Main Theoretical Results: Let f ∼ GP(0, κ). Then for seqTS, synTS, and asyTS, after n √ evaluations we have E[SR(n)] ≲ log(n)Ψn/n.

f (x∗ )

x∗

Let the time taken for an evaluation be random, Then the expected number of completed evaluations for seqTS, synTS, and asyTS satisfy nseq < nsyn < nasy.

x

f (x)

Minimise Simple Regret. SR(n) = f(x⋆) − max f(xt).

Therefore, asyTS achieves asymptotically better regret with time ′ E[SR (T)] than synTS and seqTS. .

t=1,...,n

x

Bayesian Optimisation via Thompson Sampling Model f ∼ GP(0, κ). At time t sample g from posterior GP. Choose xt = argmaxx∈X g(x). yt ← evaluate f at xt. f (x)

Distribution

f (x)

Unif(a, b) HN (ζ ) 2

x

x

xt

N ← # of completed evaluations. In parallel settings, this is ′ by all M workers. SR (T) is practically more relevant than SR(n) and leads to new results in parallel BO.

1 b−a for x ∈ √ x2 − √2 e 2ζ 2 for ζ π

seqTS

(a, b) nseq = x > 0 nseq =

λe−λx for x > 0

Exp(λ)

x

2T b+a √ T√π ζ 2

synTS

asyTS

nsyn = M T(M+1) a+bM nasy = Mnseq Mnseq √ nsyn ≍

log(M) Mnseq log(M)

nsyn ≍

nseq = λT

nasy = Mnseq nasy = Mnseq

Diﬀerence between synchronous and asynchronous bounds grows with M and is pronounced for heavy tailed distributions.

This work: Parallel evaluations with M workers Several methods for this setting, but they either, ▶ cannot handle asynchronicity. ▶ do not come with theoretical guarantees. ▶ are conceptually/computationally complex.

Experiments Hartmann18, d = 18, M = 25, halfnormal

Park2, d = 4, M = 10, halfnormal synRAND synBUCB synUCBPE synTS asyRAND asyUCB asyEI asyHUCB asyHTS asyTS

10 0

SR′ (T )

Simple regret on a time budget: After time T, { f(x⋆) − maxj≤N f(xj) if N ≥ 1 ′ SR (T) = . maxx∈X |f(x⋆) − f(x)| otherwise

pdf p(x)

CurrinExp-14, d = 14, M = 35, pareto(k=3)

6.5 6

25

5.5 5 20 4.5

SR′ (T )

f (x)

nseq, nsyn, nasy for diﬀerent random completion time models.

SR′ (T )

f : X ≡ [0, 1] → R is a noisy, expensive, black-box function. x⋆ = argmaxx∈X f(x) is a maximiser. d

f (x)

Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for a worker to ﬁnish. 2. Dj ← Dj−1 ∪ {(x′, y′)} where (x′, y′) are the worker’s previous query/observation. 3. Compute posterior GP with Dj. 4. Draw a sample g ∼ GP. 5. Deploy worker at argmax g.

4 3.5

15

3 2.5 10

10 -1 0

5

10

15

Time units (T )

20

25

0

5

10

15

20

Time units (T )

25

30

0

5

10

Time units (T )

See paper for more synthetic and real experiments. Selected References: • Russo D. et al 2014, Learning to Optimize via Posterior Sampling. • Srinivas N. et al. 2010, Gaussian Process Optimization in the Bandit Setting ….

15

20

## Asynchronous Parallel Bayesian Optimisation via ...

Asynchronous Parallel Bayesian Optimisation via Thompson Sampling. Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, BarnabÃ¡s PÃ³czos.

#### Recommend Documents

Asynchronous Parallel Bayesian Optimisation via ...
Related Work: Bayesian optimisation methods start with a prior belief distribution for f and ... We work in the Bayesian paradigm, modeling f itself as a random quantity. ..... Parallel predictive entropy search for batch global optimization.

Asynchronous Parallel Coordinate Minimization ... - Research at Google
passing inference is performed by multiple processing units simultaneously without coordination, all reading and writing to shared ... updates. Our approach gives rise to a message-passing procedure, where messages are computed and updated in shared

Learning Click Models via Probit Bayesian Inference
Oct 26, 2010 - republish, to post on servers or to redistribute to lists, requires prior specific ... P e rp le xity. Query Frequency. UBM(Likelihood). UBM(MAP). Figure 1: The perplexity score on different query frequencies achieved by the UBM model

Parallel Dynamic Tree Contraction via Self-Adjusting ...
rithm for the dynamic trees problem, which requires computing ... This problem requires computing various prop- ..... Symposium on Cloud Computing, 2011.

Learning Click Models via Probit Bayesian Inference
Oct 26, 2010 - web search ranking. ..... computation can be carried out very fast, as well as with ... We now develop an inference algorithm for the framework.

Symmetry Breaking by Nonstationary Optimisation
easiest to find under the variable/value order- ing but dynamic ... the problem at each search node A is to find a .... no constraint programmer would use such a.

Organic Search Engine Optimisation Somerset.pdf
attorney for your company, you could end up with legal problems that. seriously impact your ability to do business (or sometimes even stay in. business).

Asynchronous Stochastic Optimization for ... - Research at Google
for sequence training, although in a rather limited and controlled way . Overall ... 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) ..... Advances in Speech Recognition: Mobile Environments, Call.