Parallelised Thompson Sampling for BO

• Setting: Bayesian optimisation with parallel evaluations. • A direct application of Thompson sampling in synchronous and asynchronous parallel settings does essentially as well as if the evaluations were made in sequence. • When evaluation time is factored in, the asynchronous version outperforms the synchronous and sequential versions. • Proposed methods are conceptually and computationally much simpler than existing methods for parallel BO.

A straightforward application of TS to parallel settings: Asynchronous: asyTS Synchronous: synTS

Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for all workers to ﬁnish. 2. Dj ← Dj−1 ∪ {(xm, ym)}M m=1 where (xm, ym) are worker m’s query/observation. 3. Compute posterior GP with Dj. 4. Draw m samples gm ∼ GP, ∀m. 5. Deploy worker m at argmax gm, ∀m.

Gaussian Process (Bayesian) Optimisation Expensive Blackbox Function Examples: - Hyper-parameter Tuning - ML estimation in Astrophysics - Optimal policy in Autonomous Driving

Main Theoretical Results: Let f ∼ GP(0, κ). Then for seqTS, synTS, and asyTS, after n √ evaluations we have E[SR(n)] ≲ log(n)Ψn/n.

f (x∗ )

x∗

Let the time taken for an evaluation be random, Then the expected number of completed evaluations for seqTS, synTS, and asyTS satisfy nseq < nsyn < nasy.

x

f (x)

Minimise Simple Regret. SR(n) = f(x⋆) − max f(xt).

Therefore, asyTS achieves asymptotically better regret with time ′ E[SR (T)] than synTS and seqTS. .

t=1,...,n

x

Bayesian Optimisation via Thompson Sampling Model f ∼ GP(0, κ). At time t sample g from posterior GP. Choose xt = argmaxx∈X g(x). yt ← evaluate f at xt. f (x)

Distribution

f (x)

Unif(a, b) HN (ζ ) 2

x

x

xt

N ← # of completed evaluations. In parallel settings, this is ′ by all M workers. SR (T) is practically more relevant than SR(n) and leads to new results in parallel BO.

1 b−a for x ∈ √ x2 − √2 e 2ζ 2 for ζ π

seqTS

(a, b) nseq = x > 0 nseq =

λe−λx for x > 0

Exp(λ)

x

2T b+a √ T√π ζ 2

synTS

asyTS

nsyn = M T(M+1) a+bM nasy = Mnseq Mnseq √ nsyn ≍

log(M) Mnseq log(M)

nsyn ≍

nseq = λT

nasy = Mnseq nasy = Mnseq

Diﬀerence between synchronous and asynchronous bounds grows with M and is pronounced for heavy tailed distributions.

This work: Parallel evaluations with M workers Several methods for this setting, but they either, ▶ cannot handle asynchronicity. ▶ do not come with theoretical guarantees. ▶ are conceptually/computationally complex.

Experiments Hartmann18, d = 18, M = 25, halfnormal

Park2, d = 4, M = 10, halfnormal synRAND synBUCB synUCBPE synTS asyRAND asyUCB asyEI asyHUCB asyHTS asyTS

10 0

SR′ (T )

Simple regret on a time budget: After time T, { f(x⋆) − maxj≤N f(xj) if N ≥ 1 ′ SR (T) = . maxx∈X |f(x⋆) − f(x)| otherwise

pdf p(x)

CurrinExp-14, d = 14, M = 35, pareto(k=3)

6.5 6

25

5.5 5 20 4.5

SR′ (T )

f (x)

nseq, nsyn, nasy for diﬀerent random completion time models.

SR′ (T )

f : X ≡ [0, 1] → R is a noisy, expensive, black-box function. x⋆ = argmaxx∈X f(x) is a maximiser. d

f (x)

Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for a worker to ﬁnish. 2. Dj ← Dj−1 ∪ {(x′, y′)} where (x′, y′) are the worker’s previous query/observation. 3. Compute posterior GP with Dj. 4. Draw a sample g ∼ GP. 5. Deploy worker at argmax g.

4 3.5

15

3 2.5 10

10 -1 0

5

10

15

Time units (T )

20

25

0

5

10

15

20

Time units (T )

25

30

0

5

10

Time units (T )

See paper for more synthetic and real experiments. Selected References: • Russo D. et al 2014, Learning to Optimize via Posterior Sampling. • Srinivas N. et al. 2010, Gaussian Process Optimization in the Bandit Setting ….

15

20