Adaptive Consensus ADMM for Distributed Optimization Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein

Outline • Consensus problem in distributed computing • Alternating direction method of multipliers (ADMM) and penalty parameter • Adaptive consensus ADMM (ACADMM) with spectral stepsize: fully automated optimizer • The O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Statistical learning problem min f (v) + g(v) v

• Example:

1 min kDx x 2

ck22 + ⇢1 kxk1 +

⇢2 kxk22 2

2

2

2

4

4

4

6

6

6

8

8

8

=

10 12



10 12

10 12

14

14

16

16

16

18

18

18

20

20

c

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

14

20 2

4

6

8

10

12

D

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Problem decomposition and data parallelism min v

• Example: min

N X 1 i=1

2

N X

fi (v) + g(v)

i=1

kDi x

2

ci k2 + ⇢1 |x| +

⇢2 kxk2 2

2

2 4

4 6

6 8

8 10

10 12

12 14

14 16

16 18

18

2

2 4

2 4 6 8 10

=

2 4 6

4

4 6 8

6

6 8 10

8

10 12 14

12

12 14 16

14

14 16 18

16

16 18 20

20

18 02.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

c = [c1 ; . . . ; ci ; . . . ; cN ] 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5



8 10 12

10 12 14 16 18

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

D = [D1 ; . . . ; Di ; . . . ; DN ]

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Consensus problem min ui ,v

• Example:

N X

fi (ui ) + g(v), subject to ui = v.

i=1

f1 (u1 )

2 4

fi (ui )

fN (uN ) 2 4 6 8

1 kDi ui 2

10 12

6

14 8

2

16 10

4

18 12

6

20 2

14

8

16

10

4

6

8

10

12

14

16

18

20

fi (ui ) =

18

12

20

14

2

4

6

8

10

12

14

16

18

20

16

ci k2 2

18 20 2

4

6

8

10

12

14

16

18

4

20

6

local nodes

8

2

=

4

2



4

6

6

8

8

10

10

12

12

14

14

16

16

18

18

20

20

10 12 14 16 18

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

central server v and g(v)

g(v)

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

2

4

6

8

10

12

14

g(v) = ⇢1 |v| +

16

18

20

⇢2 kvk2 2

Consensus ADMM N X

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i ui

4 3 2

v k+1 = arg min g(v) +

1

v

0 -1 -2

k+1 i

-3 -4 2 1

2 1

0 0

-1

-1 -2

-2

=

k i

+ ⌧ik (v k+1

N X

k i,

(h

vk k i,

i=1 uk+1 ) i

ui i + ⌧i /2kv k k

v

u i k2

uk+1 i + ⌧i /2kv i k

2 uk+1 k ) i

Consensus ADMM and penalty parameter N X

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i

k i,

ui

4 3 2

v k+1 = arg min g(v) +

1 0

v

-1 -2

k+1 i

-3 -4 2 1

=

k i

+ ⌧ik (v k+1

N X

(h

k ui i + ⌧i /2 kv k

vk k i,

i=1

v

k ⌧ uk+1 i + /2 kv i i

uk+1 ) i

2 1

0 0

-1

-1 -2

-2

ui k2

The only free parameter!

uk+1 k2 ) i

Background: gradient descent • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

⌧k xk

Background: quadratic case • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

↵ • If quadratic F (x) = kx x⇤ k2 2 • Then optimal stepsize ⌧ k = 1/↵

⌧ k = 1/↵

xk xk+1

Background: spectral stepsize • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

• Spectral (Barzilai-Borwein) stepsize • Assume the function is locally quadratic with ⌧ k = 1/↵ curvature ↵ • Estimate the curvature by solving xk 1-d least squares rF (x) = ↵x + a • Gradient descent with ⌧ k = 1/↵

xk+1

J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988.

Advantages of spectral stepsize • Automates the stepsize selection • Achieves fast (superlinear) convergence steps

• What about ADMM?

⌧ k = 1/↵

xk xk+1 J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988. Y. Dai. A New Analysis on the Barzilai-Borwein Gradient Method. 2013.

Spectral penalty of ADMM Consensus problem

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

• Assume the function is locally quadratic • Estimate the curvature(s) • Decide penalty parameter

Dual interpretation • Consensus problem •

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

min f (u) + g(v), subject to u + Bv = 0 u,v

where B =



(Id ; . . . ; Id ) u = (u1 ; . . . ; uN )

Dual problem by Fenchel conjugate min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

No constraints!

Dual problem and DRS min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

(u, v, ) ADMM Douglas-Rachford Splitting ( ˆ , ) where ˆ k+1 = k + ⌧ k (v k i

fˆ( ˆ )

gˆ( )

i

i

uk+1 ) i

Linear approximation • The gradients are linear 2

2

4

4

6

6

8

8

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

= ↵·

10

2

2

4

4

6

6

8

8

10

12

12

14

14

16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆ

=

·

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

Linear approximation • The gradients are linear 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

12

16

14

14

18

16

16

20

18

18

6 8

6

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

2

2 4

4 4

4 6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

= ↵·

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=

·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 4

2

2

6

4

4

8

6

10

8

8

10

10

12

12

14

14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 12 14 14 16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

6

= [ i ]iN

Linear approximation • The gradients are linear • Node specific penalty parameter 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

16

14

18

16

20

18

18

20

20

6 8 10 12 14 16

↵1

18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6

1

12 14 16

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

4

4

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

= ↵i ·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

4

4

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=



0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 6 8 10 12 14

↵N

4

2

2

6

4

4

8

6

10

8

12 14

16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

10 12 14

6 8

N

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

= [ i ]iN

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017 C. Song, S. Yoon, and V. Pavlovic. Fast ADMM algorithm for distributed optimization with adaptive penalty. 2016.

Linear approximation • The gradients are linear

@ fˆ( ˆ ) = M↵ ˆ +

• Node specific penalty parameter

where M↵ , M are diagonal matrices.

T ↵ = M↵ 1

ˆk

T =M

fˆ( ˆ ) k

and @ˆ g( ) = M

+ ,

1

gˆ( )

Node-specific spectral penalty • Schema

⌧ik

p = 1/ ↵i

i,

8i = 1, . . . , N

• Estimation and safeguarding: from ADMM variables (u, v, ) 2

2

4

4

6 8 10 12 14

=

16 18

uki

↵ik ·

2

4

4

6

6

6

8

8

10

10

ˆk

12

i

14

ˆ k0

12

i

14

16

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆk

=

16

k ↵cor,i

18 20

i

uki 0

2

20

uki 0

uki

ˆ k0 i

14

k cor,i

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

12

18

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

k0

v k0

k i

k0 i

8

10

16

18

k

k i·

vk

k i

k0 i

O (1/k) convergence with adaptivity • Bounded adaptivity

1 X

k=1

(⌘ k )2 < 1, where (⌘ k )2 =

(⌘ik )2 = max{⌧ik /⌧ik

1

max {(⌘ik )2 },

i2{1,...,p}

1, ⌧ik

1

/⌧ik

1}.

• The norm of the residuals converges to zero • The worst-case ergodic O(1/k) convergence rate in the variational inequality sense

Experiments • ADMM methods • • • •

Consensus ADMM [Boyd et al. 2011] Residual balancing [He et al. 2000] Consensus residual balancing [Song et al. 2016] Adaptive ADMM [Xu et al. 2017]

• Applications • • • •

Linear regression with elastic net regularizer Sparse logistic regression Support vector machine Semidefinite programming

Residual plot • Application: Sparse logistic regression • Dataset: News 20, size 19,996 × 1,355,191 • Distributed on 128 cores 101

5 CADMM AADMM ACADMM

10-1

ACADMM

10-2 10-3 10

-4

10

-5

4 3

Penalty tau

Relative residual

100

2 1 0

0

50

100

150

200

Iterations

250

300

350

-1 -50

0

50

100

150

Iterations

200

250

300

350

More numerical results • More results in the paper! Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application

Dataset

CADMM

RB-ADMM

AADMM

CRB-ADMM ACADMM

MNIST 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) News20 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) MNIST 325(444) 212(387) 325(516) 203(286) 149(218) Sparse logreg News20 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) MNIST 1000+(930) 172(287) 73(127) 285(340) 41(88.0) SVM News20 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) SDP Ham-9-5-6 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703)

EN regression

Robust to initial penalty selection • More sensitivity analysis in the paper! 10

3

ENRegression-Synthetic2

Iterations

ACADMM 102

101 -2 10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

0

2

10

Initial penalty parameter

10

4

Acceleration by distribution 10

SVM-Synthetic2

3

10

SVM-Synthetic2

4

10

10

Seconds

Iterations

103 2

1

ACADMM

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

1

2

10

Number of cores

10

10

2

1

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

1

2

10

Number of cores

Summary • Fully automated optimizer for consensus problem in distributed computing • Node-specific spectral penalty for ADMM • O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Thank you! Poster #28 tonight Gavin Taylor Hao Li Mario Figueiredo Xiaoming Yuan Tom Goldstein

1

5

10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

0

4

10

-1

10

-2

10

-3

1

10

-4

0

10

-5

3

Penalty tau

Relative residual

10

0

50

100

150

200

Iterations

250

300

350

2

-1 -50

0

50

100

150

Iterations

200

250

300

350

Adaptive Consensus ADMM for Distributed Optimization

Adaptive Consensus ADMM for Distributed Optimization. Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein ...

5MB Sizes 2 Downloads 54 Views

Recommend Documents

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

Accelerated Distributed Average Consensus via ...
Sep 17, 2009 - Networks Laboratory, Department of Electrical and Computer Engineering, .... connected, e.g., maximum-degree and Metropolis weights [12],. [16]. In the next ...... Foundations Computer Science, Palo Alto, CA, Nov. 1998, pp.

A Method for Distributed Optimization for Task Allocation
the graph that underlies the network of information exchange. A case study involving ... firefighting, disaster management, and multi-robot cooperation. [1-3].

Adaptive Response System for Distributed Denial-of-Service Attacks
itself. The dissertation also presents another DDoS mitigation sys- tem, Traffic Redirection Attack Protection System (TRAPS). [1], designed for the IPv6 networks.

Distributed Dual Averaging for Convex Optimization ...
The issue is not however essential and we prove that a simple correction term ...... Illustration of the effect of fixed edge delays on distributed dual averaging.

MapReduce/Bigtable for Distributed Optimization - Research at Google
With large data sets, it can be time consuming to run gradient based optimiza- tion, for example to minimize the log-likelihood for maximum entropy models.