Adaptive Consensus ADMM for Distributed Optimization Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein
Outline • Consensus problem in distributed computing • Alternating direction method of multipliers (ADMM) and penalty parameter • Adaptive consensus ADMM (ACADMM) with spectral stepsize: fully automated optimizer • The O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets
Statistical learning problem min f (v) + g(v) v
• Example:
1 min kDx x 2
ck22 + ⇢1 kxk1 +
⇢2 kxk22 2
2
2
2
4
4
4
6
6
6
8
8
8
=
10 12
⇤
10 12
10 12
14
14
16
16
16
18
18
18
20
20
c
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
14
20 2
4
6
8
10
12
D
14
16
18
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
x
Problem decomposition and data parallelism min v
• Example: min
N X 1 i=1
2
N X
fi (v) + g(v)
i=1
kDi x
2
ci k2 + ⇢1 |x| +
⇢2 kxk2 2
2
2 4
4 6
6 8
8 10
10 12
12 14
14 16
16 18
18
2
2 4
2 4 6 8 10
=
2 4 6
4
4 6 8
6
6 8 10
8
10 12 14
12
12 14 16
14
14 16 18
16
16 18 20
20
18 02.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
18 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20
20
c = [c1 ; . . . ; ci ; . . . ; cN ] 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
⇤
8 10 12
10 12 14 16 18
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
D = [D1 ; . . . ; Di ; . . . ; DN ]
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
x
Consensus problem min ui ,v
• Example:
N X
fi (ui ) + g(v), subject to ui = v.
i=1
f1 (u1 )
2 4
fi (ui )
fN (uN ) 2 4 6 8
1 kDi ui 2
10 12
6
14 8
2
16 10
4
18 12
6
20 2
14
8
16
10
4
6
8
10
12
14
16
18
20
fi (ui ) =
18
12
20
14
2
4
6
8
10
12
14
16
18
20
16
ci k2 2
18 20 2
4
6
8
10
12
14
16
18
4
20
6
local nodes
8
2
=
4
2
⇤
4
6
6
8
8
10
10
12
12
14
14
16
16
18
18
20
20
10 12 14 16 18
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
central server v and g(v)
g(v)
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
2
4
6
8
10
12
14
g(v) = ⇢1 |v| +
16
18
20
⇢2 kvk2 2
Consensus ADMM N X
min ui ,v
fi (ui ) + g(v), subject to ui = v.
i=1
uk+1 = arg min fi (ui ) + h i ui
4 3 2
v k+1 = arg min g(v) +
1
v
0 -1 -2
k+1 i
-3 -4 2 1
2 1
0 0
-1
-1 -2
-2
=
k i
+ ⌧ik (v k+1
N X
k i,
(h
vk k i,
i=1 uk+1 ) i
ui i + ⌧i /2kv k k
v
u i k2
uk+1 i + ⌧i /2kv i k
2 uk+1 k ) i
Consensus ADMM and penalty parameter N X
min ui ,v
fi (ui ) + g(v), subject to ui = v.
i=1
uk+1 = arg min fi (ui ) + h i
k i,
ui
4 3 2
v k+1 = arg min g(v) +
1 0
v
-1 -2
k+1 i
-3 -4 2 1
=
k i
+ ⌧ik (v k+1
N X
(h
k ui i + ⌧i /2 kv k
vk k i,
i=1
v
k ⌧ uk+1 i + /2 kv i i
uk+1 ) i
2 1
0 0
-1
-1 -2
-2
ui k2
The only free parameter!
uk+1 k2 ) i
Background: gradient descent • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
⌧k xk
Background: quadratic case • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
↵ • If quadratic F (x) = kx x⇤ k2 2 • Then optimal stepsize ⌧ k = 1/↵
⌧ k = 1/↵
xk xk+1
Background: spectral stepsize • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
• Spectral (Barzilai-Borwein) stepsize • Assume the function is locally quadratic with ⌧ k = 1/↵ curvature ↵ • Estimate the curvature by solving xk 1-d least squares rF (x) = ↵x + a • Gradient descent with ⌧ k = 1/↵
xk+1
J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988.
Advantages of spectral stepsize • Automates the stepsize selection • Achieves fast (superlinear) convergence steps
• What about ADMM?
⌧ k = 1/↵
xk xk+1 J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988. Y. Dai. A New Analysis on the Barzilai-Borwein Gradient Method. 2013.
Spectral penalty of ADMM Consensus problem
min ui ,v
N X
fi (ui ) + g(v), subject to ui = v.
i=1
• Assume the function is locally quadratic • Estimate the curvature(s) • Decide penalty parameter
Dual interpretation • Consensus problem •
min ui ,v
N X
fi (ui ) + g(v), subject to ui = v.
i=1
min f (u) + g(v), subject to u + Bv = 0 u,v
where B =
•
(Id ; . . . ; Id ) u = (u1 ; . . . ; uN )
Dual problem by Fenchel conjugate min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )
g ˆ( )
No constraints!
Dual problem and DRS min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )
g ˆ( )
(u, v, ) ADMM Douglas-Rachford Splitting ( ˆ , ) where ˆ k+1 = k + ⌧ k (v k i
fˆ( ˆ )
gˆ( )
i
i
uk+1 ) i
Linear approximation • The gradients are linear 2
2
4
4
6
6
8
8
10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
= ↵·
10
2
2
4
4
6
6
8
8
10
12
12
14
14
16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
ˆ
=
·
10 12 14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ˆ g( )
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017
Linear approximation • The gradients are linear 2 2 4 4
2
2
6
4
4
8
6
10
8
8
12
10
10
14
12
12
16
14
14
18
16
16
20
18
18
6 8
6
10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
2
2 4
4 4
4 6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
= ↵·
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
=
·
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2 2 4 4
2
2
6
4
4
8
6
10
8
8
10
10
12
12
14
14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6 8 10 12 12 14 14 16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
ˆ = [ ˆ i ]iN
@ˆ g( )
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017
6
= [ i ]iN
Linear approximation • The gradients are linear • Node specific penalty parameter 2 2 4 4
2
2
6
4
4
8
6
10
8
8
12
10
10
14
12
16
14
18
16
20
18
18
20
20
6 8 10 12 14 16
↵1
18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6
1
12 14 16
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
4
4
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
= ↵i ·
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
4
4
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
=
i·
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2 2 4 6 8 10 12 14
↵N
4
2
2
6
4
4
8
6
10
8
12 14
16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
ˆ = [ ˆ i ]iN
10 12 14
6 8
N
10 12 14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ˆ g( )
= [ i ]iN
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017 C. Song, S. Yoon, and V. Pavlovic. Fast ADMM algorithm for distributed optimization with adaptive penalty. 2016.
Linear approximation • The gradients are linear
@ fˆ( ˆ ) = M↵ ˆ +
• Node specific penalty parameter
where M↵ , M are diagonal matrices.
T ↵ = M↵ 1
ˆk
T =M
fˆ( ˆ ) k
and @ˆ g( ) = M
+ ,
1
gˆ( )
Node-specific spectral penalty • Schema
⌧ik
p = 1/ ↵i
i,
8i = 1, . . . , N
• Estimation and safeguarding: from ADMM variables (u, v, ) 2
2
4
4
6 8 10 12 14
=
16 18
uki
↵ik ·
2
4
4
6
6
6
8
8
10
10
ˆk
12
i
14
ˆ k0
12
i
14
16
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
ˆk
=
16
k ↵cor,i
18 20
i
uki 0
2
20
uki 0
uki
ˆ k0 i
14
k cor,i
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
12
18
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
k0
v k0
k i
k0 i
8
10
16
18
k
k i·
vk
k i
k0 i
O (1/k) convergence with adaptivity • Bounded adaptivity
1 X
k=1
(⌘ k )2 < 1, where (⌘ k )2 =
(⌘ik )2 = max{⌧ik /⌧ik
1
max {(⌘ik )2 },
i2{1,...,p}
1, ⌧ik
1
/⌧ik
1}.
• The norm of the residuals converges to zero • The worst-case ergodic O(1/k) convergence rate in the variational inequality sense
Experiments • ADMM methods • • • •
Consensus ADMM [Boyd et al. 2011] Residual balancing [He et al. 2000] Consensus residual balancing [Song et al. 2016] Adaptive ADMM [Xu et al. 2017]
• Applications • • • •
Linear regression with elastic net regularizer Sparse logistic regression Support vector machine Semidefinite programming
Residual plot • Application: Sparse logistic regression • Dataset: News 20, size 19,996 × 1,355,191 • Distributed on 128 cores 101
5 CADMM AADMM ACADMM
10-1
ACADMM
10-2 10-3 10
-4
10
-5
4 3
Penalty tau
Relative residual
100
2 1 0
0
50
100
150
200
Iterations
250
300
350
-1 -50
0
50
100
150
Iterations
200
250
300
350
More numerical results • More results in the paper! Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application
Dataset
CADMM
RB-ADMM
AADMM
CRB-ADMM ACADMM
MNIST 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) News20 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) MNIST 325(444) 212(387) 325(516) 203(286) 149(218) Sparse logreg News20 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) MNIST 1000+(930) 172(287) 73(127) 285(340) 41(88.0) SVM News20 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) SDP Ham-9-5-6 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703)
EN regression
Robust to initial penalty selection • More sensitivity analysis in the paper! 10
3
ENRegression-Synthetic2
Iterations
ACADMM 102
101 -2 10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
0
2
10
Initial penalty parameter
10
4
Acceleration by distribution 10
SVM-Synthetic2
3
10
SVM-Synthetic2
4
10
10
Seconds
Iterations
103 2
1
ACADMM
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
1
2
10
Number of cores
10
10
2
1
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
1
2
10
Number of cores
Summary • Fully automated optimizer for consensus problem in distributed computing • Node-specific spectral penalty for ADMM • O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets
Thank you! Poster #28 tonight Gavin Taylor Hao Li Mario Figueiredo Xiaoming Yuan Tom Goldstein
1
5
10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
0
4
10
-1
10
-2
10
-3
1
10
-4
0
10
-5
3
Penalty tau
Relative residual
10
0
50
100
150
200
Iterations
250
300
350
2
-1 -50
0
50
100
150
Iterations
200
250
300
350