Adaptive Consensus ADMM for Distributed Optimization Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein

Outline • Consensus problem in distributed computing • Alternating direction method of multipliers (ADMM) and penalty parameter • Adaptive consensus ADMM (ACADMM) with spectral stepsize: fully automated optimizer • The O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Statistical learning problem min f (v) + g(v) v

• Example:

1 min kDx x 2

ck22 + ⇢1 kxk1 +

⇢2 kxk22 2

2

2

2

4

4

4

6

6

6

8

8

8

=

10 12



10 12

10 12

14

14

16

16

16

18

18

18

20

20

c

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

14

20 2

4

6

8

10

12

D

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Problem decomposition and data parallelism min v

• Example: min

N X 1 i=1

2

N X

fi (v) + g(v)

i=1

kDi x

2

ci k2 + ⇢1 |x| +

⇢2 kxk2 2

2

2 4

4 6

6 8

8 10

10 12

12 14

14 16

16 18

18

2

2 4

2 4 6 8 10

=

2 4 6

4

4 6 8

6

6 8 10

8

10 12 14

12

12 14 16

14

14 16 18

16

16 18 20

20

18 02.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

c = [c1 ; . . . ; ci ; . . . ; cN ] 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5



8 10 12

10 12 14 16 18

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

2

4

6

8

10

12

14

16

18

20

D = [D1 ; . . . ; Di ; . . . ; DN ]

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

x

Consensus problem min ui ,v

• Example:

N X

fi (ui ) + g(v), subject to ui = v.

i=1

f1 (u1 )

2 4

fi (ui )

fN (uN ) 2 4 6 8

1 kDi ui 2

10 12

6

14 8

2

16 10

4

18 12

6

20 2

14

8

16

10

4

6

8

10

12

14

16

18

20

fi (ui ) =

18

12

20

14

2

4

6

8

10

12

14

16

18

20

16

ci k2 2

18 20 2

4

6

8

10

12

14

16

18

4

20

6

local nodes

8

2

=

4

2



4

6

6

8

8

10

10

12

12

14

14

16

16

18

18

20

20

10 12 14 16 18

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

central server v and g(v)

g(v)

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

2

4

6

8

10

12

14

g(v) = ⇢1 |v| +

16

18

20

⇢2 kvk2 2

Consensus ADMM N X

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i ui

4 3 2

v k+1 = arg min g(v) +

1

v

0 -1 -2

k+1 i

-3 -4 2 1

2 1

0 0

-1

-1 -2

-2

=

k i

+ ⌧ik (v k+1

N X

k i,

(h

vk k i,

i=1 uk+1 ) i

ui i + ⌧i /2kv k k

v

u i k2

uk+1 i + ⌧i /2kv i k

2 uk+1 k ) i

Consensus ADMM and penalty parameter N X

min ui ,v

fi (ui ) + g(v), subject to ui = v.

i=1

uk+1 = arg min fi (ui ) + h i

k i,

ui

4 3 2

v k+1 = arg min g(v) +

1 0

v

-1 -2

k+1 i

-3 -4 2 1

=

k i

+ ⌧ik (v k+1

N X

(h

k ui i + ⌧i /2 kv k

vk k i,

i=1

v

k ⌧ uk+1 i + /2 kv i i

uk+1 ) i

2 1

0 0

-1

-1 -2

-2

ui k2

The only free parameter!

uk+1 k2 ) i

Background: gradient descent • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

⌧k xk

Background: quadratic case • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

↵ • If quadratic F (x) = kx x⇤ k2 2 • Then optimal stepsize ⌧ k = 1/↵

⌧ k = 1/↵

xk xk+1

Background: spectral stepsize • Objective min F (x) x • Gradient descent xk+1 = xk

⌧ k rF (xk )

• Spectral (Barzilai-Borwein) stepsize • Assume the function is locally quadratic with ⌧ k = 1/↵ curvature ↵ • Estimate the curvature by solving xk 1-d least squares rF (x) = ↵x + a • Gradient descent with ⌧ k = 1/↵

xk+1

J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988.

Advantages of spectral stepsize • Automates the stepsize selection • Achieves fast (superlinear) convergence steps

• What about ADMM?

⌧ k = 1/↵

xk xk+1 J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988. Y. Dai. A New Analysis on the Barzilai-Borwein Gradient Method. 2013.

Spectral penalty of ADMM Consensus problem

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

• Assume the function is locally quadratic • Estimate the curvature(s) • Decide penalty parameter

Dual interpretation • Consensus problem •

min ui ,v

N X

fi (ui ) + g(v), subject to ui = v.

i=1

min f (u) + g(v), subject to u + Bv = 0 u,v

where B =



(Id ; . . . ; Id ) u = (u1 ; . . . ; uN )

Dual problem by Fenchel conjugate min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

No constraints!

Dual problem and DRS min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )

g ˆ( )

(u, v, ) ADMM Douglas-Rachford Splitting ( ˆ , ) where ˆ k+1 = k + ⌧ k (v k i

fˆ( ˆ )

gˆ( )

i

i

uk+1 ) i

Linear approximation • The gradients are linear 2

2

4

4

6

6

8

8

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

= ↵·

10

2

2

4

4

6

6

8

8

10

12

12

14

14

16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆ

=

·

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

Linear approximation • The gradients are linear 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

12

16

14

14

18

16

16

20

18

18

6 8

6

10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

2

2 4

4 4

4 6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

= ↵·

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=

·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 4

2

2

6

4

4

8

6

10

8

8

10

10

12

12

14

14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 12 14 14 16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

@ˆ g( )

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017

6

= [ i ]iN

Linear approximation • The gradients are linear • Node specific penalty parameter 2 2 4 4

2

2

6

4

4

8

6

10

8

8

12

10

10

14

12

16

14

18

16

20

18

18

20

20

6 8 10 12 14 16

↵1

18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6

1

12 14 16

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

4

4

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2

2

= ↵i ·

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

4

4

6 6

8 8

10 10

12 12

14 14

16 16

18 18

20 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

=



0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

2 2 4 6 8 10 12 14

↵N

4

2

2

6

4

4

8

6

10

8

12 14

16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ fˆ( ˆ )

ˆ = [ ˆ i ]iN

10 12 14

6 8

N

10 12 14

16

16

18

18

20

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

@ˆ g( )

= [ i ]iN

Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017 C. Song, S. Yoon, and V. Pavlovic. Fast ADMM algorithm for distributed optimization with adaptive penalty. 2016.

Linear approximation • The gradients are linear

@ fˆ( ˆ ) = M↵ ˆ +

• Node specific penalty parameter

where M↵ , M are diagonal matrices.

T ↵ = M↵ 1

ˆk

T =M

fˆ( ˆ ) k

and @ˆ g( ) = M

+ ,

1

gˆ( )

Node-specific spectral penalty • Schema

⌧ik

p = 1/ ↵i

i,

8i = 1, . . . , N

• Estimation and safeguarding: from ADMM variables (u, v, ) 2

2

4

4

6 8 10 12 14

=

16 18

uki

↵ik ·

2

4

4

6

6

6

8

8

10

10

ˆk

12

i

14

ˆ k0

12

i

14

16

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

ˆk

=

16

k ↵cor,i

18 20

i

uki 0

2

20

uki 0

uki

ˆ k0 i

14

k cor,i

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

12

18

20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

k0

v k0

k i

k0 i

8

10

16

18

k

k i·

vk

k i

k0 i

O (1/k) convergence with adaptivity • Bounded adaptivity

1 X

k=1

(⌘ k )2 < 1, where (⌘ k )2 =

(⌘ik )2 = max{⌧ik /⌧ik

1

max {(⌘ik )2 },

i2{1,...,p}

1, ⌧ik

1

/⌧ik

1}.

• The norm of the residuals converges to zero • The worst-case ergodic O(1/k) convergence rate in the variational inequality sense

Experiments • ADMM methods • • • •

Consensus ADMM [Boyd et al. 2011] Residual balancing [He et al. 2000] Consensus residual balancing [Song et al. 2016] Adaptive ADMM [Xu et al. 2017]

• Applications • • • •

Linear regression with elastic net regularizer Sparse logistic regression Support vector machine Semidefinite programming

Residual plot • Application: Sparse logistic regression • Dataset: News 20, size 19,996 × 1,355,191 • Distributed on 128 cores 101

5 CADMM AADMM ACADMM

10-1

ACADMM

10-2 10-3 10

-4

10

-5

4 3

Penalty tau

Relative residual

100

2 1 0

0

50

100

150

200

Iterations

250

300

350

-1 -50

0

50

100

150

Iterations

200

250

300

350

More numerical results • More results in the paper! Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application

Dataset

CADMM

RB-ADMM

AADMM

CRB-ADMM ACADMM

MNIST 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) News20 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) MNIST 325(444) 212(387) 325(516) 203(286) 149(218) Sparse logreg News20 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) MNIST 1000+(930) 172(287) 73(127) 285(340) 41(88.0) SVM News20 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) SDP Ham-9-5-6 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703)

EN regression

Robust to initial penalty selection • More sensitivity analysis in the paper! 10

3

ENRegression-Synthetic2

Iterations

ACADMM 102

101 -2 10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

0

2

10

Initial penalty parameter

10

4

Acceleration by distribution 10

SVM-Synthetic2

3

10

SVM-Synthetic2

4

10

10

Seconds

Iterations

103 2

1

ACADMM

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

1

2

10

Number of cores

10

10

2

1

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

1

2

10

Number of cores

Summary • Fully automated optimizer for consensus problem in distributed computing • Node-specific spectral penalty for ADMM • O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets

Thank you! Poster #28 tonight Gavin Taylor Hao Li Mario Figueiredo Xiaoming Yuan Tom Goldstein

1

5

10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

0

4

10

-1

10

-2

10

-3

1

10

-4

0

10

-5

3

Penalty tau

Relative residual

10

0

50

100

150

200

Iterations

250

300

350

2

-1 -50

0

50

100

150

Iterations

200

250

300

350

Adaptive Consensus ADMM for Distributed Optimization

Adaptive Consensus ADMM for Distributed Optimization. Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein ...

5MB Sizes 2 Downloads 232 Views

Recommend Documents

Adaptive Consensus ADMM for Distributed Optimization
defined penalty parameters. We study ... (1) by defining u = (u1; ... ; uN ) ∈ RdN , A = IdN ∈. R. dN×dN , and B ..... DR,i = (αi + βi)λk+1 + (ai + bi), where ai ∈.

Supplementary Material: Adaptive Consensus ADMM ...
inequalities together. We conclude. (Bvk+1 − Bvk)T (λk+1 − λk) ≥ 0. (S3). 1.2. Proof of ..... matrix Di ∈ Rni×d with ni samples and d features using a standard ...

Supplementary Material for Adaptive Relaxed ADMM
1Department of Computer Science, University of Maryland, College Park, MD. 2Instituto ..... data matrix D ∈ R1000×200 and a true low-rank solution given by X ...

Distributed Average Consensus Using Probabilistic ...
... applications such as data fusion and distributed coordination require distributed ..... variance, which is a topic of current exploration. Figure 3 shows the ...

Distributed Average Consensus With Dithered ... - IEEE Xplore
computation of averages of the node data over networks with band- width/power constraints or large volumes of data. Distributed averaging algorithms fail to ...

Abstract Consensus problem ADMM and diagonal ...
Example: • Others: sparse logistic regression, support vector machines .... (b Au k+1. Bv k+1. ) u k+1 i. = arg min ui fi(ui) + hλ k i. , v k uii + т k i /2 kv k uik. 2 v k+1.

MapReduce/Bigtable for Distributed Optimization
our global model parameters, we use a distributed data-store known as Bigtable ... allows backup workers to duplicate effort without producing erroneous output.

Improving convergence rate of distributed consensus through ...
Improving convergence rate of distributed consensus through asymmetric weights.pdf. Improving convergence rate of distributed consensus through asymmetric ...

Accelerated Distributed Average Consensus via ...
Sep 17, 2009 - Networks Laboratory, Department of Electrical and Computer Engineering, .... connected, e.g., maximum-degree and Metropolis weights [12],. [16]. In the next ...... Foundations Computer Science, Palo Alto, CA, Nov. 1998, pp.

DISTRIBUTED AVERAGE CONSENSUS WITH ...
“best constant” [1], is to set the neighboring edge weights to a constant ... The suboptimality of the best constant ... The degree of the node i is denoted di. |Ni|.

Adaptive ADMM with Spectral Penalty Parameter ...
Page 7 .... (BT λ). ︸ ︷︷ ︸. ˆG(λ). ,. Define ˆλk+1 = λk + τk(b − Auk+1 − Bvk),. ADMM is equivalent to Douglas-Rachford Splitting (DRS) for dual. (u, v, λ) ⇔ (ˆλ, λ).

Accelerated Distributed Average Consensus Via ...
Sep 17, 2008 - Telecommunications and Signal Processing–Computer Networks Laboratory. Department of Electrical and Computer Engineering ... A. Related Work ... its powers), retain a history of all state values, and then solve a system of ..... with

A Method for Distributed Optimization for Task Allocation
the graph that underlies the network of information exchange. A case study involving ... firefighting, disaster management, and multi-robot cooperation. [1-3].

Adaptive Distributed Network-Channel Coding For ...
cooperative wireless communications system with multiple users transmitting independent ...... Cambridge: Cambridge University Press, 2005. [13] SAGE, “Open ...

Adaptive Filters for Continuous Queries over Distributed ...
The central processor installs filters at remote ... Monitoring environmental conditions such as ... The central stream processor keeps a cached copy of [L o. ,H o. ] ...

Adaptive Response System for Distributed Denial-of-Service Attacks
itself. The dissertation also presents another DDoS mitigation sys- tem, Traffic Redirection Attack Protection System (TRAPS). [1], designed for the IPv6 networks.

Distributed Adaptive Bit-loading for Spectrum ...
Apr 24, 2008 - SMCs in an unbundled environment, one for each service provider. In such ...... [2] AT&T and BT, Power savings for broadband networks, ANSI ...

Adaptive Modulation for Distributed Switch-and-Stay ...
Posts and Telecommunications Institute of Technology. Email: [email protected]. Abstract—In this letter, we investigate the performance of distributed ...

Distributed Dual Averaging for Convex Optimization ...
The issue is not however essential and we prove that a simple correction term ...... Illustration of the effect of fixed edge delays on distributed dual averaging.

Adaptive Bound Optimization for Online Convex ... - Research at Google
realistic classes of loss functions they are much better than existing bounds. ... Existing algorithms for online convex optimization are worst-case optimal in terms of ...... The extra degrees of freedom offered by these generalized learning rates .

MapReduce/Bigtable for Distributed Optimization - Research at Google
With large data sets, it can be time consuming to run gradient based optimiza- tion, for example to minimize the log-likelihood for maximum entropy models.

X-CSR: Dataflow Optimization for Distributed XML ...
dataflow applications such as data-centric scientific workflows. We describe a ... We adopt a simple and flexible model for designing XML process networks ...

Joint Adaptive Modulation and Distributed Switch-and ...
bit error rate (BER) in fading channels [9]. Recently, the effectiveness of adaptive modulation in cooperative wireless communication systems in which power ...

A Software Framework to Support Adaptive Applications in Distributed ...
a tool to allow users to easily develop and run ADAs without ... Parallel Applications (ADA), resource allocation, process deploy- ment ..... ARCHITECTURE.