Adaptive ADMM with Spectral Penalty Parameter Selection Zheng Xu1 ,

M´ario A. T. Figueiredo2 , Tom Goldstein1

1 Department 2 Instituto

of Computer Science, University of Maryland, College Park, MD de Telecomunica¸co ˜es, Instituto Superior T´ ecnico, Universidade de Lisboa, Portugal

April, 2017

1 / 37

Outline

I

Constrained problem and alternating direction method of multipliers (ADMM)

I

Penalty parameter: crucial for practical convergence

I

Spectral penalty parameter selection: fast and fully automated

I

Numerical results on various applications and datasets

2 / 37

Constrained problem and ADMM

Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

3 / 37

Typical applications

I

Sparse linear regression (Elastic net regularizer) ρ2 1 min kDx − ck22 + ρ1 kxk1 + kxk22 x 2 2 2

2

2

4

4

4

6

6

8

8

10 12

14

16

16

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

8



12

14

18

I

=

6

10

10 12 14 16

18

18

20

20 2

4

6

8

10

12

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

Low rank problem (Nuclear norm regularizer)

4 / 37

Typical applications I

Sparse linear regression (Elastic net regularizer)

I

Low rank problem (Nuclear norm regularizer)

I

Basis pursuit

I

Semidefinite programming

I

Dual of SVM / quadratic programming

5 / 37

Typical applications I

Consensus problem for distributed computing min xi ,z

N X i=1

fi (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N. fi (xi ) fi (xi )

fi (xi )

central server z 6 / 37

Typical applications I

Consensus problem for distributed computing min xi ,z

I

N X i=1

f (xi ) + g (z) s.t. xi − z = 0, i = 1, . . . , N.

More applications: neural networks, tensor decomposition, ADMM NETS phase retrieval, robustFOR PCA, NEURAL TV image problem [Taylor et al., 2016, Xu et al., 2016a,b, 2017] a11 z21 a12 z31 a13

a21 z22 a22 a31

7 / 37

Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2

4 3 2 1 0 -1 -2 -3 -4 2 1

2 1

0 0 -1

-1 -2

-2

8 / 37

Constrained problem and ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 λ u,v 2 Alternating direction method of multipliers (ADMM) τ uk+1 = arg min H(u) + hλk , −Aui + kb − Au − Bvk k2 u 2 τ vk+1 = arg min G (v ) + hλk , −Bv i + kb − Auk+1 − Bv k2 v 2 λk+1 =λk + τ (b − Auk+1 − Bvk+1 ) 9 / 37

Penalty parameter Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Saddle point problem with augmented Lagrangian τ max min H(u) + G (v ) + hλ, b − Au − Bv i + kb − Au − Bv k2 u,v λ 2 How to select the free parameter in ADMM? uk+1 = arg min H(u) + hλk , −Aui + u

vk+1 = arg min G (v ) + hλk , −Bv i +

τk 2 τk

v

2

kb − Au − Bvk k2 kb − Auk+1 − Bv k2

λk+1 =λk + τk (b − Auk+1 − Bvk+1 ) 10 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

⌧k

xk

11 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

If quadratic, F (x) = α2 kx − x ∗ k2 , optimal τk = 1/α

⌧k = 1/↵

xk xk+1

12 / 37

Background: spectral stepsize for gradient descent Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk )

xk

⌧k = 1/↵

xk+1

13 / 37

Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a

⌧k = 1/↵

xk xk+1

14 / 37

Background: spectral stepsize for gradient descent Gradient descent: xk+1 = xk − τk ∇F (xk ) Spectral (Barzilai-Borwen) stepsize: τk = 1/α where α is the local curvature assuming ∇F (x) = αx + a and α is estimated by 1-dimensional least squares ∇F (xk ) − ∇F (xk−1 ) = α(xk − xk−1 )

⌧k = 1/↵

xk xk+1

15 / 37

Background: spectral stepsize for gradient descent I

Automates the stepsize selection

I

Achieves fast convergence

I

Constrained problem?

⌧k = 1/↵

xk xk+1

16 / 37

Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)

Gˆ(λ)

F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37

Dual interpretation of ADMM Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Dual problem without constraints min H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } λ | ˆ H(λ)

Gˆ(λ)

ˆ k+1 = λk + τk (b − Auk+1 − Bvk ), Define λ ADMM is equivalent to Douglas-Rachford Splitting (DRS) for dual ˆ λ) (u, v , λ) ⇔ (λ, F ∗ denotes the Fenchel conjugate of F , defined as F ∗ (y ) = supx hx, y i − F (x) 17 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

ˆ H ˆk

ˆ G k

18 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H(

↵ ˆ = 1/↵

ˆk

∂ Gˆ(λ) = β λ + Φ

and

ˆ H

ˆ = 1/

ˆ G

k

19 / 37

Spectral stepsize of DRS Dual problem minλ H ∗ (AT λ) − hλ, bi + G ∗ (B T λ), {z } | {z } | ˆ H(λ)

Gˆ(λ)

Approximate ∂ Hˆ and ∂ Gˆ at iteration k as linear functions ˆ λ) ˆ = αλ ˆ+Ψ ∂ H( and ∂ Gˆ(λ) = β λ + Φ [Proposition] When DRS is applied, the minimal residual of ˆ k+1 ) + Gˆ(λk+1 ) is obtained by setting H(λ √ τk = 1/ α β.

↵ ˆ = 1/↵

ˆk

ˆ H

ˆ = 1/

ˆ G

k

20 / 37

Spectral stepsize estimation I I

I

√ Spectral stepsize τk = 1/ α β ˆ Gˆ from Estimate curvature α, β of H, ˆ λ) . ADMM iterates (u, v , λ, 1-dimensional least squares with closed form solution. 2

2

4

4

6

6

8

2

2

4

4

6

6

8 8

10 12 14

= ↵·

16

A(uk

10

10

12

12

14

14

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

u k0 )

(ˆk

8

=

·

12 14

16

16

18

18

20

ˆk ) 0

10

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

B(vk

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v k0 )

(

k

k0 )

ˆ k+1 = λk + τk (b − Auk+1 − Bvk ) Recall λ 21 / 37

Safeguarding inaccurate estimation Objective: min F (x) x

Gradient descent: xk+1 = xk − τk ∇F (xk ) Backtracking linesearch

⌧k = 1/↵

xk xk+1

22 / 37

Safeguarding inaccurate estimation Constrained problem min H(u) + G (v ) u,v

subject to Au + Bv = b.

Lagrangian saddle point problem max min H(u) + G (v ) + hλ, b − Au − Bv i λ

u,v

4 3 2 1 0 -1 -2 -3 -4 2 1

2 1

0 0 -1

-1 -2

-2

23 / 37

Safeguarding 2

2

4

4

6

6

8

8

2

2

4

4

6

6

8

= ↵·

10 12 14

A(uk

10

10

12

12

14

16

16

18

18

14

20

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

u k0 )

(ˆk

(ˆk

12 14 16

18

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆk ) 0

10

16

20

A(uk

↵cor

8

·

=

B(vk

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v k0 )

uk0 )

(

B(vk

ˆk ) 0 cor

(

k

k0 )

k

v k0 ) k0 )

Validate correlations for the linear assumption of (sub)gradients.  ˆ k −λ ˆk i hA(uk −uk0 ), λ 0  αkcor = ˆ ˆ kA(uk −uk0 k kλk −λk0 k

 β cor = k

hB(vk −vk0 ), λk −λk0 i kB(vk −vk0 k kλk −λk0 k 24 / 37

Safeguarding A(uk

(ˆk



cor

uk0 )

B(vk

ˆk ) 0

Safeguarded spectral penalty  √  1/ αk βk    1/α k τk+1 =  1/β  k   τ k

cor

(

k

v k0 ) k0 )

parameter if αkcor > cor and βkcor > cor if αkcor > cor and βkcor ≤ cor if αkcor ≤ cor and βkcor > cor otherwise,

25 / 37

Convergence guarantee

I

Adaptive ADMM converges when one of the conditions is satisfied [He et al., 2000, Xu et al., 2017] I

Bounded increasing ∞ X

ηk2

k=1 I

r τk , 1} − 1 < ∞, where ηk = max{ τk−1

Bounded decreasing ∞ X k=1

θk2

r < ∞, where θk =

max{

τk−1 , 1} − 1 τk

26 / 37

Adaptive ADMM algorithm

I

ADMM steps to update (uk+1 , vk+1 , λk+1 )

I

Estimate curvatures αk , βk

I

Estimate correlations αkcor , βkcor

I

Apply safeguarded spectral penalty rule to update τk+1

I

Stop adaptivity after fixed number of iterations to guarantee convergence

27 / 37

Experiments I

ADMM [Gabay and Mercier, 1976, Glowinski and Marroco, 1975, Boyd et al., 2011]

I

Nesterov acceleration [Goldstein et al., 2014]

I

Fixed optimal penalty parameter [Raghunathan and Di Cairano, 2014]

I

Residual balancing [He et al., 2000, Boyd et al., 2011] ( rk = b − Auk − Bvk , residuals dk = τk AT B(vk − vk−1 )   if krk k2 > µkdk k2 ητk τk+1 = τk /η if kdk k2 > µkrk k2   τk otherwise, (η = 10, µ = 2) 28 / 37

Numerical results I

Applications: elastic net regularized linear regression; low rank least squares; dual SVM (quadratic programming); basis pursuit; consensus logistic regression; semidefinite programming

I

Benchmark datasets from UCI repository and LIBSVM page.

I

Initial penalty τ0 = 0.1, fixed safeguarding threshold cor = 0.2

I

Details and more results in paper!

Application EN LRLS Dual SVM BP Consensus SDP

Dataset Boston Leukemia Madelon Madelon Human1 Madelon Realsim Ham-11-2

Vanilla ADMM 2000+ 2000+ 1943 100 2000+ 2000+ 1000+ 2000+

Fast ADMM 208 2000+ 193 57 2000+ 2000+ 1000+ 2000+

Residual balance 54 (.023) 1737 (19.3) 133 (60.9) 28 (4.12) 839 (.990) 115 (42.1) 121 (558) 1203 (4.15e3)

Adaptive ADMM 17 (.011) 152 (1.70) 27 (12.8) 19 (2.64) 503 (.626) 23 (20.8) 22 (118) 447 (1.49e3) 29 / 37

Residual plot n

kdk k2 krk k2 max{kAuk k2 ,kBvk k2 ,kbk2 } , kAT λk k2

I

Relative residual: max

I

Low rank least squares min 12 kDX − C k2F + ρ1 kX k∗ + X

ρ2 2 2 kX kF

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM 101 Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

0

10

Relative residual

I

o

-1

10

-2

10

10-3 10-4 10-5

0

100

200

300

400

500

600

Iteration 30 / 37

Sensitivity: initial penalty I

Elastic net regularized linear regression ρ2 1 min kDu − ck22 + ρ1 kv k1 + kv k22 s.t. u − v = 0 u,v 2 2

I

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM

(a) Elastic net regression

102

102

101

Iterations

103

Iterations

103

(b) Q

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105

Initial penalty parameter

101

100 10-5 10-4 10-3 31 / 37

I

Sensitivity: problem scale I

Quadratic programming 1 min u T s 2 Qu + s q T u + ι{z: zi ≤c} (v ) s.t. Du − v = 0 u,v 2

I

Vanilla ADMM, Fast ADMM, Residual balance, Adaptive ADMM

(b) Quadratic programming

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

01 102 103 104 105

ale

103

102

102

(c) Low

Iterations

103

Iterations

gression

101

Vanilla ADMM Fast ADMM Residual balance Adaptive ADMM

100 10-5 10-4 10-3 10-2 10-1 100 101 102 103 104 105

Problem scale

101

100 10-5 10-4 10-3 1 32 / 37

Sensitivity: safeguarding threshold cor = 0.2 works well

3

10

Convergence iterations

I

2

10

1

10

EN LinReg Cons LogReg Quad Prog Basis Pursuit LRLS SDP

100 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Safeguarding correlation threshold

0.9

1

33 / 37

Conclusion and extensions Spectral penalty parameter selection for constrained problem ADMM is equivalent to DRS of unconstrained dual problem Combine the estimated curvatures of the two functions Effective safeguarding Fully automated and fast convergence

Relaxed ADMM [Xu et al., 2017] Nonconvex applications [Xu et al., 2016b] Multi-block ADMM [Xu et al., under review] Large-scale distributed computing [Xu et al., under review] O(1/k) convergence rate [Xu et al., under review]

34 / 37

Q&A

Thank you !

35 / 37

Reference I S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. and Trends in Mach. Learning, 3:1–122, 2011. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. R. Glowinski and A. Marroco. Sur l’approximation, par ´el´ements finis d’ordre un, et la r´esolution, par p´enalisation-dualit´e d’une classe de probl´emes de Dirichlet non lin´eaires. ESAIM: Mod´elisation Math´ematique et Analyse Num´erique, 9:41–76, 1975. T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk. Fast alternating direction optimization methods. SIAM Journal on Imaging Sciences, 7(3): 1588–1623, 2014. B. He, H. Yang, and S. Wang. Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Jour. Optim. Theory and Appl., 106(2):337–356, 2000.

36 / 37

Reference II A. Raghunathan and S. Di Cairano. Alternating direction method of multipliers for strictly convex quadratic programs: Optimal parameter selection. In American Control Conf., pages 4324–4329, 2014. G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein. Training neural networks without gradients: A scalable ADMM approach. ICML, 2016. Z. Xu, S. De, M. A. T. Figueiredo, C. Studer, and T. Goldstein. An empirical study of ADMM for nonconvex problems. In NIPS workshop on nonconvex optimization, 2016a. Z. Xu, F. Huang, L. Raschid, and T. Goldstein. Non-negative factorization of the occurrence tensor from financial contracts. In NIPS workshop on tensor methods, 2016b. Z. Xu, M. A. Figueiredo, X. Yuan, C. Studer, and T. Goldstein. Adaptive relaxed ADMM: Convergence theory and practical implementation. CVPR, 2017.

37 / 37

Adaptive ADMM with Spectral Penalty Parameter ...

Page 7 .... (BT λ). ︸ ︷︷ ︸. ˆG(λ). ,. Define ˆλk+1 = λk + τk(b − Auk+1 − Bvk),. ADMM is equivalent to Douglas-Rachford Splitting (DRS) for dual. (u, v, λ) ⇔ (ˆλ, λ).

2MB Sizes 2 Downloads 194 Views

Recommend Documents

Supplementary Material: Adaptive Consensus ADMM ...
inequalities together. We conclude. (Bvk+1 − Bvk)T (λk+1 − λk) ≥ 0. (S3). 1.2. Proof of ..... matrix Di ∈ Rni×d with ni samples and d features using a standard ...

Supplementary Material for Adaptive Relaxed ADMM
1Department of Computer Science, University of Maryland, College Park, MD. 2Instituto ..... data matrix D ∈ R1000×200 and a true low-rank solution given by X ...

Adaptive Consensus ADMM for Distributed Optimization
defined penalty parameters. We study ... (1) by defining u = (u1; ... ; uN ) ∈ RdN , A = IdN ∈. R. dN×dN , and B ..... DR,i = (αi + βi)λk+1 + (ai + bi), where ai ∈.

Adaptive Consensus ADMM for Distributed Optimization
Adaptive Consensus ADMM for Distributed Optimization. Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein ...

DISTRIBUTED PARAMETER ESTIMATION WITH SELECTIVE ...
shared with neighboring nodes, and a local consensus estimate is ob- tained by .... The complexity of the update depends on the update rule, f[·], employed at ...

an adaptive parameter control strategy for aco - Semantic Scholar
Aug 16, 2006 - 1College of Computer Science and Engineering, South China University of Technology, Guangzhou, P. R. ... one of the best performing algorithms for NP-hard problems ..... program tools, systems and computing machines.

Adaptive spectral window sizes for feature extraction ...
the spectral window sizes, the trends in the data will be ... Set the starting point of the 1st window to be the smallest ... The area under the Receiver Operating.

Penalty-TaxScan.pdf
on 13.9.2008 declaring loss of Rs.48,854/- which was revised on 9.3.2002. declaring a loss of Rs.68,74,346/-. During the scrutiny proceedings, the AO. completed the assessment proceedings u/s 143(3) of the Act vide order. dated 8.12.2010 by assessing

Spectral Clustering with Limited Independence
Oct 2, 2006 - data in which each object is represented as a vector over the set of features, ... and perhaps simpler “clean-up” phase than known algo- rithms.

A Bayesian approach to optimal monetary policy with parameter and ...
This paper undertakes a Bayesian analysis of optimal monetary policy for the United Kingdom. ... to participants in the JEDC conference and the Norges Bank conference, ... uncertainty that confront monetary policy in a systematic way. ...... 2 call f

Penalty Proceedings Under RTI Act.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

Modelling Inequality with a Single Parameter
Abstract We argue that for most economies the Lorenz curve for income ..... [W] Wolff, Edward N., “Recent Trends in Wealth Ownership, 1983-1998,” Work-.

A Bayesian approach to optimal monetary policy with parameter and ...
more useful communication tools. .... instance, we compare micro-founded and non micro-founded models, RE vs. non-RE models, .... comparison with the others. ...... Kimball, M S (1995), 'The quantitative analytics of the basic neomonetarist ...

Parameter Estimation with Out-of-Sample Objective
Apr 22, 2016 - Y represents future data and X is the sample available for estimation. .... In this general framework we establish analytical results that are.

Parameter Penduduk.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Parameter ...

Low-Rank Spectral Learning with Weighted Loss ... - EECS @ Michigan
domains. 1 INTRODUCTION. Predictive state representations (PSRs) are compact models of ...... Low-Rank Spectral Learning with Weighted Loss Functions. 0. 50. 100. 10−5. 100. |H| .... Dimension-free concentration bounds on hankel ma-.

Spectral unmixing versus spectral angle mapper for ...
to assess the classification performance for identifying and mapping 'desert like' .... The derived spectral angle maps form a new data cube with the number of bands equal .... Interactive visualization and analysis of imaging spectrometer data.

Juvenile Death Penalty
support for the death penalty: respondents who are white (Apple- gate et al. 1993; Bohm 1987, 1991; ...... ence, 1976-1986.” Journal of Contemporary Criminal ...

Juvenile Death Penalty
of offenses which, if committed, make juveniles eligible for Waiver to adult ... teen-year-olds who are convicted of first-degree murder generally deserve the ..... measured with a simple “yes” or “no.” Instead ... money that would go to the

Spectral karyotyping
spectrum at all image points. Here we describe the principle of spectral imaging, define ... lens (the system can also be attached to any other optics such as a telescope or a .... taken for an infinite interferogram, and the zero filling is an optio

Spectral unmixing versus spectral angle mapper for ...
process of desertification and land degradation as a whole. Hyperspectral remote ..... Mapping target signatures via partial unmixing of ... integration of image processing, digital elevation data and field knowledge. (application to Nepal). JAG.

Spectral Clustering - Semantic Scholar
Jan 23, 2009 - 5. 3 Strengths and weaknesses. 6. 3.1 Spherical, well separated clusters . ..... Step into the extracted folder “xvdm spectral” by typing.