Adaptive ADMM with Spectral Penalty Parameter Selection Zheng Xu1 ,
M´ario A. T. Figueiredo2 , Tom Goldstein1
1 Department 2 Instituto
of Computer Science, University of Maryland, College Park, MD de Telecomunica¸co ˜es, Instituto Superior T´ ecnico, Universidade de Lisboa, Portugal
April, 2017
1 / 37
Outline
I
Constrained problem and alternating direction method of multipliers (ADMM)
I
Penalty parameter: crucial for practical convergence
I
Spectral penalty parameter selection: fast and fully automated
I
Numerical results on various applications and datasets
2 / 37
Constrained problem and ADMM
Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
3 / 37
Typical applications
I
Sparse linear regression (Elastic net regularizer) ρ2 1 min kDx − ck22 + ρ1 kxk1 + kxk22 x 2 2 2
2
2
4
4
4
6
6
8
8
10 12
14
16
16
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
8
⇤
12
14
18
I
=
6
10
10 12 14 16
18
18
20
20 2
4
6
8
10
12
14
16
18
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
Low rank problem (Nuclear norm regularizer)
4 / 37
Typical applications I
Sparse linear regression (Elastic net regularizer)
I
Low rank problem (Nuclear norm regularizer)
I
Basis pursuit
I
Semidefinite programming
I
Dual of SVM / quadratic programming
5 / 37
Typical applications I
Consensus problem for distributed computing min xi ,z
N X i=1
fi (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N. fi (xi ) fi (xi )
fi (xi )
central server z 6 / 37
Typical applications I
Consensus problem for distributed computing min xi ,z
I
N X i=1
f (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N.
More applications: neural networks, tensor decomposition, ADMM NETS phase retrieval, robustFOR PCA, NEURAL TV image problem [Taylor et al., 2016, Xu et al., 2016a,b, 2017] a11 z21 a12 z31 a13
a21 z22 a22 a31
7 / 37
Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2
4 3 2 1 0 -1 -2 -3 -4 2 1
2 1
0 0 -1
-1 -2
-2
8 / 37
Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 λ u,v 2 Alternating direction method of multipliers (ADMM) τ uk+1 = arg min H(u) + hλk , −Aui + kb − Au − Bvk k2 u 2 τ vk+1 = arg min G (v ) + hλk , −Bv i + kb − Auk+1 − Bv k2 v 2 λk+1 =λk + τ (b − Auk+1 − Bvk+1 ) 9 / 37
Penalty parameter Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2 How to select the free parameter in ADMM? uk+1 = arg min H(u) + hλk , −Aui + u
vk+1 = arg min G (v ) + hλk , −Bv i +
τk 2 τk
v
2
kb − Au − Bvk k2 kb − Auk+1 − Bv k2
λk+1 =λk + τk (b − Auk+1 − Bvk+1 ) 10 / 37
Background: spectral stepsize for gradient descent Objective: min F (x) x
Gradient descent: xk+1 = xk − τk ∇F (xk )
⌧k
xk
11 / 37
Background: spectral stepsize for gradient descent Objective: min F (x) x
Gradient descent: xk+1 = xk − τk ∇F (xk )
If quadratic, F (x) = α2 kx − x ∗ k2 , optimal τk = 1/α
⌧k = 1/↵
xk xk+1
12 / 37
Background: spectral stepsize for gradient descent Objective: min F (x) x
Gradient descent: xk+1 = xk − τk ∇F (xk )
xk
⌧k = 1/↵
xk+1
13 / 37
Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a
⌧k = 1/↵
xk xk+1
14 / 37
Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a and α is estimated by 1-dimensional least squares ∇F (xk ) − ∇F (xk−1 ) = α(xk − xk−1 )
⌧k = 1/↵
xk xk+1
15 / 37
Background: spectral stepsize for gradient descent I
Automates the stepsize selection
I
Achieves fast convergence
I
Constrained problem?
⌧k = 1/↵
xk xk+1
16 / 37
Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)
Gˆ(λ)
F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37
Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)
Gˆ(λ)
ˆ k+1 = λk + τk (b − Auk+1 − Bvk ), Define λ ADMM is equivalent to Douglas-Rachford Splitting (DRS) for dual ˆ λ) (u, v , λ) ⇔ (λ, F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37
Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)
Gˆ(λ)
ˆ H ˆk
ˆ G k
18 / 37
Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)
Gˆ(λ)
Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H(
↵ ˆ = 1/↵
ˆk
∂ Gˆ(λ) = β λ + Φ
and
ˆ H
ˆ = 1/
ˆ G
k
19 / 37
Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)
Gˆ(λ)
Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H( and ∂ Gˆ(λ) = β λ + Φ [Proposition] When DRS is applied, the minimal residual of ˆ k+1 ) + Gˆ(λk+1 ) is obtained by setting H(λ √ τk = 1/ α β.
↵ ˆ = 1/↵
ˆk
ˆ H
ˆ = 1/
ˆ G
k
20 / 37
Spectral stepsize estimation I I
I
√ Spectral stepsize τk = 1/ α β ˆ Gˆ from Estimate curvature α, β of H, ˆ λ) . ADMM iterates (u, v , λ, 1-dimensional least squares with closed form solution. 2
2
4
4
6
6
8
2
2
4
4
6
6
8 8
10 12 14
= ↵·
16
A(uk
10
10
12
12
14
14
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
u k0 )
(ˆk
8
=
·
12 14
16
16
18
18
20
ˆk ) 0
10
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
B(vk
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v k0 )
(
k
k0 )
ˆ k+1 = λk + τk (b − Auk+1 − Bvk ) Recall λ 21 / 37
Safeguarding inaccurate estimation Objective: min F (x) x
Gradient descent: xk+1 = xk − τk ∇F (xk ) Backtracking linesearch
⌧k = 1/↵
xk xk+1
22 / 37
Safeguarding inaccurate estimation Constrained problem min H(u) + G (v ) u,v
subject to Au + Bv = b.
Lagrangian saddle point problem max min H(u) + G (v ) + hλ, b − Au − Bv i λ
u,v
4 3 2 1 0 -1 -2 -3 -4 2 1
2 1
0 0 -1
-1 -2
-2
23 / 37
Safeguarding 2
2
4
4
6
6
8
8
2
2
4
4
6
6
8
= ↵·
10 12 14
A(uk
10
10
12
12
14
16
16
18
18
14
20
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
u k0 )
(ˆk
(ˆk
12 14 16
18
18 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
ˆk ) 0
10
16
20
A(uk
↵cor
8
·
=
B(vk
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v k0 )
uk0 )
(
B(vk
ˆk ) 0 cor
(
k
k0 )
k
v k0 ) k0 )
Validate correlations for the linear assumption of (sub)gradients. ˆ k −λ ˆk i hA(uk −uk0 ), λ 0 αkcor = ˆ ˆ kA(uk −uk0 k kλk −λk0 k
β cor = k
hB(vk −vk0 ), λk −λk0 i kB(vk −vk0 k kλk −λk0 k 24 / 37
Safeguarding A(uk
(ˆk
↵
cor
uk0 )
B(vk
ˆk ) 0
Safeguarded spectral penalty √ 1/ αk βk 1/α k τk+1 = 1/β k τ k
cor
(
k
v k0 ) k0 )
parameter if αkcor > cor and βkcor > cor if αkcor > cor and βkcor ≤ cor if αkcor ≤ cor and βkcor > cor otherwise,
25 / 37
Convergence guarantee
I
Adaptive ADMM converges when one of the conditions is satisfied [He et al., 2000, Xu et al., 2017] I
Bounded increasing ∞ X
ηk2
k=1 I
r τk , 1} − 1 < ∞, where ηk = max{ τk−1
Bounded decreasing ∞ X k=1
θk2
r < ∞, where θk =
max{
τk−1 , 1} − 1 τk
26 / 37
Adaptive ADMM algorithm
I
ADMM steps to update (uk+1 , vk+1 , λk+1 )
I
Estimate curvatures αk , βk
I
Estimate correlations αkcor , βkcor
I
Apply safeguarded spectral penalty rule to update τk+1
I
Stop adaptivity after fixed number of iterations to guarantee convergence
27 / 37
Experiments I
ADMM [Gabay and Mercier, 1976, Glowinski and Marroco, 1975, Boyd et al., 2011]
I
Nesterov acceleration [Goldstein et al., 2014]
I
Fixed optimal penalty parameter [Raghunathan and Di Cairano, 2014]
I
Residual balancing [He et al., 2000, Boyd et al., 2011] ( rk = b − Auk − Bvk , residuals dk = τk AT B(vk − vk−1 ) if krk k2 > µkdk k2 ητk τk+1 = τk /η if kdk k2 > µkrk k2 τk otherwise, (η = 10, µ = 2) 28 / 37
Numerical results I
Applications: elastic net regularized linear regression; low rank least squares; dual SVM (quadratic programming); basis pursuit; consensus logistic regression; semidefinite programming
I
Benchmark datasets from UCI repository and LIBSVM page.
I
Initial penalty τ0 = 0.1, fixed safeguarding threshold cor = 0.2
I
Details and more results in paper!
Application EN LRLS Dual SVM BP Consensus SDP
Dataset Boston Leukemia Madelon Madelon Human1 Madelon Realsim Ham-11-2
Vanilla ADMM 2000+ 2000+ 1943 100 2000+ 2000+ 1000+ 2000+
Fast ADMM 208 2000+ 193 57 2000+ 2000+ 1000+ 2000+
Residual balance 54 (.023) 1737 (19.3) 133 (60.9) 28 (4.12) 839 (.990) 115 (42.1) 121 (558) 1203 (4.15e3)
Adaptive ADMM 17 (.011) 152 (1.70) 27 (12.8) 19 (2.64) 503 (.626) 23 (20.8) 22 (118) 447 (1.49e3) 29 / 37
Residual plot n
kdk k2 krk k2 max{kAuk k2 ,kBvk k2 ,kbk2 } , kAT λk k2
I
Relative residual: max
I
Low rank least squares min 12 kDX − C k2F + ρ1 kX k∗ + X
ρ2 2 2 kX kF
Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM 101 Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM
0
10
Relative residual
I
o
-1
10
-2
10
10-3 10-4 10-5
0
100
200
300
400
500
600
Iteration 30 / 37
Sensitivity: initial penalty I
Elastic net regularized linear regression ρ2 1 min kDu − ck22 + ρ1 kv k1 + kv k22 s.t. u − v = 0 u,v 2 2
I
Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM
(a) Elastic net regression
102
102
101
Iterations
103
Iterations
103
(b) Q
Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM
100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105
Initial penalty parameter
101
100 10-5 10-4 10-3 31 / 37
I
Sensitivity: problem scale I
Quadratic programming 1 min u T s 2 Qu + s q T u + ι{z: zi ≤c} (v ) s.t. Du − v = 0 u,v 2
I
Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM
(b) Quadratic programming
Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM
01 102 103 104 105
ale
103
102
102
(c) Low
Iterations
103
Iterations
gression
101
Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM
100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105
Problem scale
101
100 10-5 10-4 10-3 1 32 / 37
Sensitivity: safeguarding threshold cor = 0.2 works well
3
10
Convergence iterations
I
2
10
1
10
EN LinReg Cons LogReg Quad Prog Basis Pursuit LRLS SDP
100 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Safeguarding correlation threshold
0.9
1
33 / 37
Conclusion and extensions Spectral penalty parameter selection for constrained problem ADMM is equivalent to DRS of unconstrained dual problem Combine the estimated curvatures of the two functions Effective safeguarding Fully automated and fast convergence
Relaxed ADMM [Xu et al., 2017] Nonconvex applications [Xu et al., 2016b] Multi-block ADMM [Xu et al., under review] Large-scale distributed computing [Xu et al., under review] O(1/k) convergence rate [Xu et al., under review]
34 / 37
Q&A
Thank you !
35 / 37
Reference I S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning, 3:1–122, 2011. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements finis d’ordre un, et la r´esolution, par p´enalisation-dualit´e d’une classe de probl´emes de Dirichlet non lin´eaires. ESAIM: Mod´elisation Math´ematique et Analyse Num´erique, 9:41–76, 1975. T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3): 1588–1623, 2014. B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Jour. Optim. Theory and Appl., 106(2):337–356, 2000.
36 / 37
Reference II A. Raghunathan and S. Di Cairano. Alternating direction method of multipliers for strictly convex quadratic programs: Optimal parameter selection. In American Control Conf., pages 4324–4329, 2014. G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without gradients: A scalable ADMM approach. ICML, 2016. Z. Xu, S. De, M. A. T. Figueiredo, C. Studer, and T. Goldstein. An empirical study of ADMM for nonconvex problems. In NIPS workshop on nonconvex optimization, 2016a. Z. Xu, F. Huang, L. Raschid, and T. Goldstein. Non-negative factorization of the occurrence tensor from financial contracts. In NIPS workshop on tensor methods, 2016b. Z. Xu, M. A. Figueiredo, X. Yuan, C. Studer, and T. Goldstein. Adaptive relaxed ADMM: Convergence theory and practical implementation. CVPR, 2017.
37 / 37