Adaptive Consensus ADMM for Distributed Optimization Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein

Outline • Consensus problem in distributed computing • Alternating direction method of multipliers (ADMM) and penalty parameter • Adaptive consensus ADMM (ACADMM) with spectral stepsize: fully automated optimizer • The O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Statistical learning problem min f (v) + g(v) v

• Example:

1 min kDx x 2

ck22 + ⇢1 kxk1 +

⇢2 kxk22 2

2

2

2

4

4

4

6

6

6

8

8

8

=

10 12

10 12

10 12

14

14

16

16

16

18

18

18

20

20

c

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

14

20 2

4

6

8

10

12

D

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Problem decomposition and data parallelism min v

• Example: min

N X 1 i=1

2

N X

fi (v) + g(v)

i=1

kDi x

2

ci k2 + ⇢1 |x| +

⇢2 kxk2 2

2

2 4

4 6

6 8

8 10

10 12

12 14

14 16

16 18

18

2

2 4

2 4 6 8 10

=

2 4 6

4

4 6 8

6

6 8 10

8

10 12 14

12

12 14 16

14

14 16 18

16

16 18 20

20

18 02.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

c = [c1 ; . . . ; ci ; . . . ; cN ] 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

8 10 12

10 12 14 16 18

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

D = [D1 ; . . . ; Di ; . . . ; DN ]

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Consensus problem min ui ,v

• Example:

N X

fi (ui ) + g(v), subject to ui = v.

i=1

f1 (u1 )

2 4

fi (ui )

fN (uN ) 2 4 6 8

1 kDi ui 2

10 12

6

14 8

2

16 10

4

18 12

6

20 2

14

8

16

10

4

6

8

10

12

14

16

18

20

fi (ui ) =

18

12

20

14

2

4

6

8

10

12

14

16

18

20

16

ci k2 2

18 20 2

4

6

8

10

12

14

16

18

4

20

6

local nodes

8

2

=

4

2

4

6

6

8

8

10

10

12

12

14

14

16

16

18

18

20

20

10 12 14 16 18

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

central server v and g(v)

g(v)

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

2

4

6

8

10

12

14

g(v) = ⇢1 |v| +

16

18

20

⇢2 kvk2 2

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i ui

4 3 2

v k+1 = arg min g(v) +

1

v

0 -1 -2

k+1 i

-3 -4 2 1

2 1

0 0

-1

-1 -2

-2

=

k i

+ ⌧ik (v k+1

N X

k i,

(h

vk k i,

i=1 uk+1 ) i

ui i + ⌧i /2kv k k

v

u i k2

uk+1 i + ⌧i /2kv i k

2 uk+1 k ) i

Consensus ADMM and penalty parameter N X

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i

k i,

ui

4 3 2

v k+1 = arg min g(v) +

1 0

v

-1 -2

k+1 i

-3 -4 2 1

=

k i

+ ⌧ik (v k+1

N X

(h

k ui i + ⌧i /2 kv k

vk k i,

i=1

v

k ⌧ uk+1 i + /2 kv i i

uk+1 ) i

2 1

0 0

-1

-1 -2

-2

ui k2

The only free parameter!

uk+1 k2 ) i

Background: gradient descent • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

⌧k xk

Background: quadratic case • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

↵ • If quadratic F (x) = kx x⇤ k2 2 • Then optimal stepsize ⌧ k = 1/↵

⌧ k = 1/↵

xk xk+1

Background: spectral stepsize • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

• Spectral (Barzilai-Borwein) stepsize • Assume the function is locally quadratic with ⌧ k = 1/↵ curvature ↵ • Estimate the curvature by solving xk 1-d least squares rF (x) = ↵x + a • Gradient descent with ⌧ k = 1/↵

xk+1

J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988.

Advantages of spectral stepsize • Automates the stepsize selection • Achieves fast (superlinear) convergence steps

⌧ k = 1/↵

xk xk+1 J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988. Y. Dai. A New Analysis on the Barzilai-Borwein Gradient Method. 2013.

Spectral penalty of ADMM Consensus problem

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

• Assume the function is locally quadratic • Estimate the curvature(s) • Decide penalty parameter

Dual interpretation • Consensus problem •

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

min f (u) + g(v), subject to u + Bv = 0 u,v

where B =

(Id ; . . . ; Id ) u = (u1 ; . . . ; uN )

Dual problem by Fenchel conjugate min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

No constraints!

Dual problem and DRS min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

(u, v, ) ADMM Douglas-Rachford Splitting ( ˆ , ) where ˆ k+1 = k + ⌧ k (v k i

fˆ( ˆ )

gˆ( )

i

i

uk+1 ) i

Linear approximation • The gradients are linear 2

2

4

4

6

6

8

8

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

= ↵·

10

2

2

4

4

6

6

8

8

10

12

12

14

14

16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆ

=

·

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

Linear approximation • The gradients are linear 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

12

16

14

14

18

16

16

20

18

18

6 8

6

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

2

2 4

4 4

4 6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

= ↵·

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=

·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 4

2

2

6

4

4

8

6

10

8

8

10

10

12

12

14

14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 12 14 14 16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

6

= [ i ]iN

Linear approximation • The gradients are linear • Node specific penalty parameter 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

16

14

18

16

20

18

18

20

20

6 8 10 12 14 16

↵1

18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6

1

12 14 16

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

4

4

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

= ↵i ·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

4

4

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 6 8 10 12 14

↵N

4

2

2

6

4

4

8

6

10

8

12 14

16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

10 12 14

6 8

N

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

= [ i ]iN

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017 C. Song, S. Yoon, and V. Pavlovic. Fast ADMM algorithm for distributed optimization with adaptive penalty. 2016.

Linear approximation • The gradients are linear

@ fˆ( ˆ ) = M↵ ˆ +

• Node specific penalty parameter

where M↵ , M are diagonal matrices.

T ↵ = M↵ 1

ˆk

T =M

fˆ( ˆ ) k

and @ˆ g( ) = M

+ ,

1

gˆ( )

Node-specific spectral penalty • Schema

⌧ik

p = 1/ ↵i

i,

8i = 1, . . . , N

• Estimation and safeguarding: from ADMM variables (u, v, ) 2

2

4

4

6 8 10 12 14

=

16 18

uki

↵ik ·

2

4

4

6

6

6

8

8

10

10

ˆk

12

i

14

ˆ k0

12

i

14

16

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆk

=

16

k ↵cor,i

18 20

i

uki 0

2

20

uki 0

uki

ˆ k0 i

14

k cor,i

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

12

18

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

k0

v k0

k i

k0 i

8

10

16

18

k

k i·

vk

k i

k0 i

1 X

k=1

(⌘ k )2 < 1, where (⌘ k )2 =

(⌘ik )2 = max{⌧ik /⌧ik

1

max {(⌘ik )2 },

i2{1,...,p}

1, ⌧ik

1

/⌧ik

1}.

• The norm of the residuals converges to zero • The worst-case ergodic O(1/k) convergence rate in the variational inequality sense

Experiments • ADMM methods • • • •

Consensus ADMM [Boyd et al. 2011] Residual balancing [He et al. 2000] Consensus residual balancing [Song et al. 2016] Adaptive ADMM [Xu et al. 2017]

• Applications • • • •

Linear regression with elastic net regularizer Sparse logistic regression Support vector machine Semidefinite programming

Residual plot • Application: Sparse logistic regression • Dataset: News 20, size 19,996 × 1,355,191 • Distributed on 128 cores 101

10-1

10-2 10-3 10

-4

10

-5

4 3

Penalty tau

Relative residual

100

2 1 0

0

50

100

150

200

Iterations

250

300

350

-1 -50

0

50

100

150

Iterations

200

250

300

350

More numerical results • More results in the paper! Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application

Dataset

MNIST 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) News20 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) MNIST 325(444) 212(387) 325(516) 203(286) 149(218) Sparse logreg News20 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) MNIST 1000+(930) 172(287) 73(127) 285(340) 41(88.0) SVM News20 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) SDP Ham-9-5-6 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703)

EN regression

Robust to initial penalty selection • More sensitivity analysis in the paper! 10

3

ENRegression-Synthetic2

Iterations

101 -2 10

10

0

2

10

Initial penalty parameter

10

4

Acceleration by distribution 10

SVM-Synthetic2

3

10

SVM-Synthetic2

4

10

10

Seconds

Iterations

103 2

1

10

1

2

10

Number of cores

10

10

2

1

10

1

2

10

Number of cores

Summary • Fully automated optimizer for consensus problem in distributed computing • Node-specific spectral penalty for ADMM • O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Thank you! Poster #28 tonight Gavin Taylor Hao Li Mario Figueiredo Xiaoming Yuan Tom Goldstein

1

5

10

0

4

10

-1

10

-2

10

-3

1

10

-4

0

10

-5

3

Penalty tau

Relative residual

10

0

50

100

150

200

Iterations

250

300

350

2

-1 -50

0

50

100

150

Iterations

200

250

300

350

Adaptive Consensus ADMM for Distributed Optimization. Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein ...

#### Recommend Documents

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

Accelerated Distributed Average Consensus via ...
Sep 17, 2009 - Networks Laboratory, Department of Electrical and Computer Engineering, .... connected, e.g., maximum-degree and Metropolis weights [12],. [16]. In the next ...... Foundations Computer Science, Palo Alto, CA, Nov. 1998, pp.

A Method for Distributed Optimization for Task Allocation
the graph that underlies the network of information exchange. A case study involving ... firefighting, disaster management, and multi-robot cooperation. [1-3].

Adaptive Response System for Distributed Denial-of-Service Attacks
itself. The dissertation also presents another DDoS mitigation sys- tem, Traffic Redirection Attack Protection System (TRAPS). [1], designed for the IPv6 networks.

Distributed Dual Averaging for Convex Optimization ...
The issue is not however essential and we prove that a simple correction term ...... Illustration of the effect of fixed edge delays on distributed dual averaging.

MapReduce/Bigtable for Distributed Optimization - Research at Google
With large data sets, it can be time consuming to run gradient based optimiza- tion, for example to minimize the log-likelihood for maximum entropy models.