Adaptive ADMM with Spectral Penalty Parameter ...

Viewer
Transcript

Adaptive ADMM with Spectral Penalty Parameter Selection Zheng Xu1 ,

M´ario A. T. Figueiredo2 , Tom Goldstein1

1 Department 2 Instituto

of Computer Science, University of Maryland, College Park, MD de Telecomunica¸co ˜es, Instituto Superior T´ ecnico, Universidade de Lisboa, Portugal

April, 2017

1 / 37

Outline

I

Constrained problem and alternating direction method of multipliers (ADMM)

I

Penalty parameter: crucial for practical convergence

I

Spectral penalty parameter selection: fast and fully automated

I

Numerical results on various applications and datasets

2 / 37

Constrained problem and ADMM

Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

3 / 37

Typical applications

I

Sparse linear regression (Elastic net regularizer) ρ2 1 min kDx − ck22 + ρ1 kxk1 + kxk22 x 2 2 2

2

2

4

4

4

6

6

8

8

10 12

14

16

16

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

8

⇤

12

14

18

I

=

6

10

10 12 14 16

18

18

20

20 2

4

6

8

10

12

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

Low rank problem (Nuclear norm regularizer)

4 / 37

Typical applications I

Sparse linear regression (Elastic net regularizer)

I

Low rank problem (Nuclear norm regularizer)

I

Basis pursuit

I

Semidefinite programming

I

Dual of SVM / quadratic programming

5 / 37

Typical applications I

Consensus problem for distributed computing min xi ,z

N X i=1

fi (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N. fi (xi ) fi (xi )

fi (xi )

central server z 6 / 37

Typical applications I

Consensus problem for distributed computing min xi ,z

I

N X i=1

f (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N.

More applications: neural networks, tensor decomposition, ADMM NETS phase retrieval, robustFOR PCA, NEURAL TV image problem [Taylor et al., 2016, Xu et al., 2016a,b, 2017] a11 z21 a12 z31 a13

a21 z22 a22 a31

7 / 37

Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2

4 3 2 1 0 -1 -2 -3 -4 2 1

2 1

0 0 -1

-1 -2

-2

8 / 37

Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 λ u,v 2 Alternating direction method of multipliers (ADMM) τ uk+1 = arg min H(u) + hλk , −Aui + kb − Au − Bvk k2 u 2 τ vk+1 = arg min G (v ) + hλk , −Bv i + kb − Auk+1 − Bv k2 v 2 λk+1 =λk + τ (b − Auk+1 − Bvk+1 ) 9 / 37

Penalty parameter Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2 How to select the free parameter in ADMM? uk+1 = arg min H(u) + hλk , −Aui + u

vk+1 = arg min G (v ) + hλk , −Bv i +

τk 2 τk

v

2

kb − Au − Bvk k2 kb − Auk+1 − Bv k2

λk+1 =λk + τk (b − Auk+1 − Bvk+1 ) 10 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

⌧k

xk

11 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

If quadratic, F (x) = α2 kx − x ∗ k2 , optimal τk = 1/α

⌧k = 1/↵

xk xk+1

12 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

xk

⌧k = 1/↵

xk+1

13 / 37

Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a

⌧k = 1/↵

xk xk+1

14 / 37

Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a and α is estimated by 1-dimensional least squares ∇F (xk ) − ∇F (xk−1 ) = α(xk − xk−1 )

⌧k = 1/↵

xk xk+1

15 / 37

Background: spectral stepsize for gradient descent I

Automates the stepsize selection

I

Achieves fast convergence

I

Constrained problem?

⌧k = 1/↵

xk xk+1

16 / 37

Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)

Gˆ(λ)

F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37

Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)

Gˆ(λ)

ˆ k+1 = λk + τk (b − Auk+1 − Bvk ), Define λ ADMM is equivalent to Douglas-Rachford Splitting (DRS) for dual ˆ λ) (u, v , λ) ⇔ (λ, F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

ˆ H ˆk

ˆ G k

18 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H(

↵ ˆ = 1/↵

ˆk

∂ Gˆ(λ) = β λ + Φ

and

ˆ H

ˆ = 1/

ˆ G

k

19 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H( and ∂ Gˆ(λ) = β λ + Φ [Proposition] When DRS is applied, the minimal residual of ˆ k+1 ) + Gˆ(λk+1 ) is obtained by setting H(λ √ τk = 1/ α β.

↵ ˆ = 1/↵

ˆk

ˆ H

ˆ = 1/

ˆ G

k

20 / 37

Spectral stepsize estimation I I

I

√ Spectral stepsize τk = 1/ α β ˆ Gˆ from Estimate curvature α, β of H, ˆ λ) . ADMM iterates (u, v , λ, 1-dimensional least squares with closed form solution. 2

2

4

4

6

6

8

2

2

4

4

6

6

8 8

10 12 14

= ↵·

16

A(uk

10

10

12

12

14

14

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

u k0 )

(ˆk

8

=

·

12 14

16

16

18

18

20

ˆk ) 0

10

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

B(vk

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v k0 )

(

k

k0 )

ˆ k+1 = λk + τk (b − Auk+1 − Bvk ) Recall λ 21 / 37

Safeguarding inaccurate estimation Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk ) Backtracking linesearch

⌧k = 1/↵

xk xk+1

22 / 37

Safeguarding inaccurate estimation Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Lagrangian saddle point problem max min H(u) + G (v ) + hλ, b − Au − Bv i λ

u,v

4 3 2 1 0 -1 -2 -3 -4 2 1

2 1

0 0 -1

-1 -2

-2

23 / 37

Safeguarding 2

2

4

4

6

6

8

8

2

2

4

4

6

6

8

= ↵·

10 12 14

A(uk

10

10

12

12

14

16

16

18

18

14

20

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

u k0 )

(ˆk

(ˆk

12 14 16

18

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆk ) 0

10

16

20

A(uk

↵cor

8

·

=

B(vk

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v k0 )

uk0 )

(

B(vk

ˆk ) 0 cor

(

k

k0 )

k

v k0 ) k0 )

Validate correlations for the linear assumption of (sub)gradients.  ˆ k −λ ˆk i hA(uk −uk0 ), λ 0  αkcor = ˆ ˆ kA(uk −uk0 k kλk −λk0 k

 β cor = k

hB(vk −vk0 ), λk −λk0 i kB(vk −vk0 k kλk −λk0 k 24 / 37

Safeguarding A(uk

(ˆk

↵

cor

uk0 )

B(vk

ˆk ) 0

Safeguarded spectral penalty  √  1/ αk βk    1/α k τk+1 =  1/β  k   τ k

cor

(

k

v k0 ) k0 )

parameter if αkcor > cor and βkcor > cor if αkcor > cor and βkcor ≤ cor if αkcor ≤ cor and βkcor > cor otherwise,

25 / 37

Convergence guarantee

I

Adaptive ADMM converges when one of the conditions is satisfied [He et al., 2000, Xu et al., 2017] I

Bounded increasing ∞ X

ηk2

k=1 I

r τk , 1} − 1 < ∞, where ηk = max{ τk−1

Bounded decreasing ∞ X k=1

θk2

r < ∞, where θk =

max{

τk−1 , 1} − 1 τk

26 / 37

Adaptive ADMM algorithm

I

ADMM steps to update (uk+1 , vk+1 , λk+1 )

I

Estimate curvatures αk , βk

I

Estimate correlations αkcor , βkcor

I

Apply safeguarded spectral penalty rule to update τk+1

I

Stop adaptivity after fixed number of iterations to guarantee convergence

27 / 37

Experiments I

ADMM [Gabay and Mercier, 1976, Glowinski and Marroco, 1975, Boyd et al., 2011]

I

Nesterov acceleration [Goldstein et al., 2014]

I

Fixed optimal penalty parameter [Raghunathan and Di Cairano, 2014]

I

Residual balancing [He et al., 2000, Boyd et al., 2011] ( rk = b − Auk − Bvk , residuals dk = τk AT B(vk − vk−1 )   if krk k2 > µkdk k2 ητk τk+1 = τk /η if kdk k2 > µkrk k2   τk otherwise, (η = 10, µ = 2) 28 / 37

Numerical results I

Applications: elastic net regularized linear regression; low rank least squares; dual SVM (quadratic programming); basis pursuit; consensus logistic regression; semidefinite programming

I

Benchmark datasets from UCI repository and LIBSVM page.

I

Initial penalty τ0 = 0.1, fixed safeguarding threshold cor = 0.2

I

Details and more results in paper!

Application EN LRLS Dual SVM BP Consensus SDP

Dataset Boston Leukemia Madelon Madelon Human1 Madelon Realsim Ham-11-2

Vanilla ADMM 2000+ 2000+ 1943 100 2000+ 2000+ 1000+ 2000+

Fast ADMM 208 2000+ 193 57 2000+ 2000+ 1000+ 2000+

Residual balance 54 (.023) 1737 (19.3) 133 (60.9) 28 (4.12) 839 (.990) 115 (42.1) 121 (558) 1203 (4.15e3)

Adaptive ADMM 17 (.011) 152 (1.70) 27 (12.8) 19 (2.64) 503 (.626) 23 (20.8) 22 (118) 447 (1.49e3) 29 / 37

Residual plot n

kdk k2 krk k2 max{kAuk k2 ,kBvk k2 ,kbk2 } , kAT λk k2

I

Relative residual: max

I

Low rank least squares min 12 kDX − C k2F + ρ1 kX k∗ + X

ρ2 2 2 kX kF

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM 101 Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

0

10

Relative residual

I

o

-1

10

-2

10

10-3 10-4 10-5

0

100

200

300

400

500

600

Iteration 30 / 37

Sensitivity: initial penalty I

Elastic net regularized linear regression ρ2 1 min kDu − ck22 + ρ1 kv k1 + kv k22 s.t. u − v = 0 u,v 2 2

I

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM

(a) Elastic net regression

102

102

101

Iterations

103

Iterations

103

(b) Q

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105

Initial penalty parameter

101

100 10-5 10-4 10-3 31 / 37

I

Sensitivity: problem scale I

Quadratic programming 1 min u T s 2 Qu + s q T u + ι{z: zi ≤c} (v ) s.t. Du − v = 0 u,v 2

I

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM

(b) Quadratic programming

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

01 102 103 104 105

ale

103

102

102

(c) Low

Iterations

103

Iterations

gression

101

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105

Problem scale

101

100 10-5 10-4 10-3 1 32 / 37

Sensitivity: safeguarding threshold cor = 0.2 works well

3

10

Convergence iterations

I

2

10

1

10

EN LinReg Cons LogReg Quad Prog Basis Pursuit LRLS SDP

100 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Safeguarding correlation threshold

0.9

1

33 / 37

Conclusion and extensions Spectral penalty parameter selection for constrained problem ADMM is equivalent to DRS of unconstrained dual problem Combine the estimated curvatures of the two functions Effective safeguarding Fully automated and fast convergence

Relaxed ADMM [Xu et al., 2017] Nonconvex applications [Xu et al., 2016b] Multi-block ADMM [Xu et al., under review] Large-scale distributed computing [Xu et al., under review] O(1/k) convergence rate [Xu et al., under review]

34 / 37

Q&A

Thank you !

35 / 37

Reference I S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning, 3:1–122, 2011. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements finis d’ordre un, et la r´esolution, par p´enalisation-dualit´e d’une classe de probl´emes de Dirichlet non lin´eaires. ESAIM: Mod´elisation Math´ematique et Analyse Num´erique, 9:41–76, 1975. T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3): 1588–1623, 2014. B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Jour. Optim. Theory and Appl., 106(2):337–356, 2000.

36 / 37

Reference II A. Raghunathan and S. Di Cairano. Alternating direction method of multipliers for strictly convex quadratic programs: Optimal parameter selection. In American Control Conf., pages 4324–4329, 2014. G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without gradients: A scalable ADMM approach. ICML, 2016. Z. Xu, S. De, M. A. T. Figueiredo, C. Studer, and T. Goldstein. An empirical study of ADMM for nonconvex problems. In NIPS workshop on nonconvex optimization, 2016a. Z. Xu, F. Huang, L. Raschid, and T. Goldstein. Non-negative factorization of the occurrence tensor from financial contracts. In NIPS workshop on tensor methods, 2016b. Z. Xu, M. A. Figueiredo, X. Yuan, C. Studer, and T. Goldstein. Adaptive relaxed ADMM: Convergence theory and practical implementation. CVPR, 2017.

37 / 37

Supplementary Material: Adaptive Consensus ADMM ...

Supplementary Material for Adaptive Relaxed ADMM

Adaptive Consensus ADMM for Distributed Optimization

DISTRIBUTED PARAMETER ESTIMATION WITH SELECTIVE ...

an adaptive parameter control strategy for aco - Semantic Scholar

Adaptive spectral window sizes for feature extraction ...

Penalty-TaxScan.pdf

Spectral Clustering with Limited Independence

A Bayesian approach to optimal monetary policy with parameter and ...

Penalty Proceedings Under RTI Act.pdf

Modelling Inequality with a Single Parameter

A Bayesian approach to optimal monetary policy with parameter and ...

Parameter Estimation with Out-of-Sample Objective

Parameter Penduduk.pdf

Low-Rank Spectral Learning with Weighted Loss ... - EECS @ Michigan

Spectral unmixing versus spectral angle mapper for ...

Juvenile Death Penalty

Spectral karyotyping

Spectral unmixing versus spectral angle mapper for ...

Spectral Clustering - Semantic Scholar