2
Adaptive Consensus ADMM for Distributed Optimization Zheng
1 Xu ,
Gavin
2 Taylor ,
Abstract
Hao
1 Li ,
3 Figueiredo ,
Mario
4 Yuan ,
Xiaoming
ADMM and diagonal penalty
Ø Study the alternating direction method of multipliers (ADMM) for distributed model fitting problems Ø Boost the ADMM performance by using different fine-tuned algorithm parameters on each worker node. Ø Automatically tune parameters without user oversight by assuming Barzilai-Borwen style gradients. Ø Present O(1/k) convergence rate for adaptive ADMM methods with node-specific parameters.
Consensus problem
Au
u,v
Bvi +
1/2kb
2 BvkT
Au
u v
k+1
= arg min f (u) + h Au, = arg min g(v) + h Bv,
k
u
v
k+1
k
=
k
+ T (b
Au
k+1
i + 1/2kb
i+
Bv
1/2kb
k+1
Au
Ø Dual interpretation Ø Linear assumption
Ø Adaptive rule
Bv k
k+1
Tk T
2
)
4
6
6
8
=
10 12
i=1
u1 and f (u1 ) ui and f (ui ) uN and f (uN )
Ø Consensus ADMM k+1 ui
v
local nodes
k+1
= arg min fi (ui ) + h ui
= arg min g(v) + =
k i
+
N X
(h
v
k i,
i=1
k ⌧i
(v
k+1
k k ⌧ ui i + i /2 kv
v
k+1 ui i
ui k
k+1 2 ui k )
central serverserver v andvg(v) central and g(v)
Ø Bounded adaptivity
k=1
2
4
4
4
6
6
6
8
8
=
10 12
⇤
12
10 12
14
14
16
16
16
18
18
18
20
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
14
20 2
4
6
8
10
12
14
16
18
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
1 f (u ) = kD u i i i i • Example: 2
k 1 k 1, ⌧i /⌧i
⇢2 2 ci k , g(v) = ⇢1 |v| + kvk . 2 2
• Others: sparse logistic regression, support vector machines (SVMs), semidefinite programming (SDP)
Ø Gradient descent xk+1 = xk ⌧k rF (xk ) Ø Spectral (Barzilai-Borwen) stepsize ⌧k = 1/↵
solving rF (xk )
rF (xk
=
12 14
k ↵cor,i
18
14 16
k cor,i
18 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
k0
k0 i
k i
k C ⌧ cg k+1 i k = max{min{ˆ ⌧i , (1 + 2 )⌧i } , } k 1 + Ccg/k2
Dataset
CADMM
RB-ADMM
AADMM
(Boyd et al., 2011)
(He et al., 2000)
(Xu et al., 2017a)
100+(1.49e4) 88(1.29e3) 310(700) 152(402) 1000+(930) 172(287) 100+(2.01e3) 100+(2.14e3)
Proposed (Song et al., 2016) ACADMM 87(1.27e4) 14(2.18e3) 149(368) 44(118) 285(340) 41(88.0) 100+(2.14e3) 30(703) CRB-ADMM
40(5.99e3) 310(727) 73(127) 35(860)
3
10
ENRegression-Synthetic2
2
10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM 0
10
10
2
10
4
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
1
10
1
10
10
10
3
10
2
10
1
2
ENRegression-Synthetic2
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
104
105
Number of samples
Number of cores
Ø Acceleration by distributed computing
1)
= ↵(xk
k0 i
12
20
k
k i
10
18
ˆ k0 i
v k0
8
16
Initial penalty parameter
10
xk xk+1
ˆ k0 i
16
102
1
⌧k = 1/↵
(Id ; . . . ; Id ),
14
ENRegression-Synthetic2
3
10
10 10-2
min f (u) + g(v), subject to Au + Bv = b, where u = (u1 ; . . . ; uN ), A = IdN , B = PN and f (u) = i=1 fi (ui )
ˆk i
12
k i·
vk
Ø Robust to initial penalty, #cores, #data
assuming rF (x) = ↵x + a
u,v
6
8
EN regression MNIST Sparse logreg CIFAR10 SVM MNIST SDP Ham-9-5-6
1}.
Background: spectral stepsize
Ø Constrained problem in general form
6
10
ˆk i
Application
k 2 {(⌘i ) },
Ø The norm of the residuals converges to zero Ø The worst-case ergodic O(1/k) convergence rate in the variational inequality sense
8
10
i2{1,...,p}
4
Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+.
Iterations
2
=
k 1 k max{⌧i /⌧i
max
2
4
Ø Fast convergence on synthetic and benchmark dataset. More results in paper.
Iterations
2
(⌘ ) < 1, where (⌘ ) =
k 2 (⌘i )
Ø Classification/regression problem
k 2
2
Experiments
O (1/k) convergence with adaptivity k 2
uki 0
10
k+1 ⌧i
k+1 ui )
1 X
8i = 1, . . . , N
Ø Safeguarding convergence
2
k ⌧ + i /2 kv
i,
8
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
uki 0
uki
k
uki
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
v
k+1 i
k ↵i ·
20
k i,
+ ,
xk
1)
SVM-Synthetic2
3
10
SVM-Synthetic2
4
103 10
Seconds
ui ,v
g ˆ( )
and @ˆ g( ) = M
p k ⌧i = 1/ ↵i
2
4
18
fi (ui ) + g(v), subject to ui = v.
fˆ( )
Iterations
min
min f (A ) h , bi + g (B ) | {z } | {z }
k
16
Ø Objective
T
Ø Curvature estimation and safeguarding linear assumption
2
Bvk
⇤
T
Split equations about T, M↵ , M into blocks, and apply the spectral penalty proposition (Xu et al. AISTATS 2017) for each block
k 2
Au
⇤
where Consensus M↵ , MADMM arefordiagonal matrices. Adaptive Distributed Optimization
Ø Alternating direction method of multipliers (ADMM) k
1 Goldstein
@ fˆ( ˆ ) = M↵ ˆ +
where T = diag(⌧1 Id , . . . , ⌧N Id ) is a diagonal matrix, 2 and kxkT = xT x
k+1
4
Adaptive Consensus ADMM
Ø Saddle point problem by augmented Lagrangian max min f (u) + g(v) + h , b
and Tom
14
N X
3
Iterations
1
2
101
10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
1
10
Number of cores
2
10
2
1
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
101
102
Number of cores