Asynchronous Parallel Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, Barnabás Póczos AutoML Workshop, ICML, Sydney, Aug 2017 Summary:
Parallelised Thompson Sampling for BO
• Setting: Bayesian optimisation with parallel evaluations. • A direct application of Thompson sampling in synchronous and asynchronous parallel settings does essentially as well as if the evaluations were made in sequence. • When evaluation time is factored in, the asynchronous version outperforms the synchronous and sequential versions. • Proposed methods are conceptually and computationally much simpler than existing methods for parallel BO.
A straightforward application of TS to parallel settings: Asynchronous: asyTS Synchronous: synTS
Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for all workers to finish. 2. Dj ← Dj−1 ∪ {(xm, ym)}M m=1 where (xm, ym) are worker m’s query/observation. 3. Compute posterior GP with Dj. 4. Draw m samples gm ∼ GP, ∀m. 5. Deploy worker m at argmax gm, ∀m.
Gaussian Process (Bayesian) Optimisation Expensive Blackbox Function Examples: - Hyper-parameter Tuning - ML estimation in Astrophysics - Optimal policy in Autonomous Driving
Main Theoretical Results: Let f ∼ GP(0, κ). Then for seqTS, synTS, and asyTS, after n √ evaluations we have E[SR(n)] ≲ log(n)Ψn/n.
f (x∗ )
x∗
Let the time taken for an evaluation be random, Then the expected number of completed evaluations for seqTS, synTS, and asyTS satisfy nseq < nsyn < nasy.
x
f (x)
Minimise Simple Regret. SR(n) = f(x⋆) − max f(xt).
Therefore, asyTS achieves asymptotically better regret with time ′ E[SR (T)] than synTS and seqTS. .
t=1,...,n
x
Bayesian Optimisation via Thompson Sampling Model f ∼ GP(0, κ). At time t sample g from posterior GP. Choose xt = argmaxx∈X g(x). yt ← evaluate f at xt. f (x)
Distribution
f (x)
Unif(a, b) HN (ζ ) 2
x
x
xt
N ← # of completed evaluations. In parallel settings, this is ′ by all M workers. SR (T) is practically more relevant than SR(n) and leads to new results in parallel BO.
1 b−a for x ∈ √ x2 − √2 e 2ζ 2 for ζ π
seqTS
(a, b) nseq = x > 0 nseq =
λe−λx for x > 0
Exp(λ)
x
2T b+a √ T√π ζ 2
synTS
asyTS
nsyn = M T(M+1) a+bM nasy = Mnseq Mnseq √ nsyn ≍
log(M) Mnseq log(M)
nsyn ≍
nseq = λT
nasy = Mnseq nasy = Mnseq
Difference between synchronous and asynchronous bounds grows with M and is pronounced for heavy tailed distributions.
This work: Parallel evaluations with M workers Several methods for this setting, but they either, ▶ cannot handle asynchronicity. ▶ do not come with theoretical guarantees. ▶ are conceptually/computationally complex.
Experiments Hartmann18, d = 18, M = 25, halfnormal
Park2, d = 4, M = 10, halfnormal synRAND synBUCB synUCBPE synTS asyRAND asyUCB asyEI asyHUCB asyHTS asyTS
10 0
SR′ (T )
Simple regret on a time budget: After time T, { f(x⋆) − maxj≤N f(xj) if N ≥ 1 ′ SR (T) = . maxx∈X |f(x⋆) − f(x)| otherwise
pdf p(x)
CurrinExp-14, d = 14, M = 35, pareto(k=3)
6.5 6
25
5.5 5 20 4.5
SR′ (T )
f (x)
nseq, nsyn, nasy for different random completion time models.
SR′ (T )
f : X ≡ [0, 1] → R is a noisy, expensive, black-box function. x⋆ = argmaxx∈X f(x) is a maximiser. d
f (x)
Input: Prior GP GP(0, κ). D1 ← ∅, GP1 ← GP(0, κ). for j = 1, 2, . . . 1. Wait for a worker to finish. 2. Dj ← Dj−1 ∪ {(x′, y′)} where (x′, y′) are the worker’s previous query/observation. 3. Compute posterior GP with Dj. 4. Draw a sample g ∼ GP. 5. Deploy worker at argmax g.
4 3.5
15
3 2.5 10
10 -1 0
5
10
15
Time units (T )
20
25
0
5
10
15
20
Time units (T )
25
30
0
5
10
Time units (T )
See paper for more synthetic and real experiments. Selected References: • Russo D. et al 2014, Learning to Optimize via Posterior Sampling. • Srinivas N. et al. 2010, Gaussian Process Optimization in the Bandit Setting ….
15
20