Abstract Consensus problem ADMM and diagonal ...

Viewer
Transcript

2

Adaptive Consensus ADMM for Distributed Optimization Zheng

1 Xu ,

Gavin

2 Taylor ,

Abstract

Hao

1 Li ,

3 Figueiredo ,

Mario

4 Yuan ,

Xiaoming

ADMM and diagonal penalty

Ø Study the alternating direction method of multipliers (ADMM) for distributed model fitting problems Ø Boost the ADMM performance by using different fine-tuned algorithm parameters on each worker node. Ø Automatically tune parameters without user oversight by assuming Barzilai-Borwen style gradients. Ø Present O(1/k) convergence rate for adaptive ADMM methods with node-specific parameters.

Consensus problem

Au

u,v

Bvi +

1/2kb

2 BvkT

Au

u v

k+1

= arg min f (u) + h Au, = arg min g(v) + h Bv,

k

u

v

k+1

k

=

k

+ T (b

Au

k+1

i + 1/2kb

i+

Bv

1/2kb

k+1

Au

Ø Dual interpretation Ø Linear assumption

Ø Adaptive rule

Bv k

k+1

Tk T

2

)

4

6

6

8

=

10 12

i=1

u1 and f (u1 ) ui and f (ui ) uN and f (uN )

Ø Consensus ADMM k+1 ui

v

local nodes

k+1

= arg min fi (ui ) + h ui

= arg min g(v) + =

k i

+

N X

(h

v

k i,

i=1

k ⌧i

(v

k+1

k k ⌧ ui i + i /2 kv

v

k+1 ui i

ui k

k+1 2 ui k )

central serverserver v andvg(v) central and g(v)

Ø Bounded adaptivity

k=1

2

4

4

4

6

6

6

8

8

=

10 12

⇤

12

10 12

14

14

16

16

16

18

18

18

20

20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

14

20 2

4

6

8

10

12

14

16

18

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

1 f (u ) = kD u i i i i • Example: 2

k 1 k 1, ⌧i /⌧i

⇢2 2 ci k , g(v) = ⇢1 |v| + kvk . 2 2

• Others: sparse logistic regression, support vector machines (SVMs), semidefinite programming (SDP)

Ø Gradient descent xk+1 = xk ⌧k rF (xk ) Ø Spectral (Barzilai-Borwen) stepsize ⌧k = 1/↵

solving rF (xk )

rF (xk

=

12 14

k ↵cor,i

18

14 16

k cor,i

18 20

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

v

k0

k0 i

k i

k C ⌧ cg k+1 i k = max{min{ˆ ⌧i , (1 + 2 )⌧i } , } k 1 + Ccg/k2

Dataset

CADMM

RB-ADMM

AADMM

(Boyd et al., 2011)

(He et al., 2000)

(Xu et al., 2017a)

100+(1.49e4) 88(1.29e3) 310(700) 152(402) 1000+(930) 172(287) 100+(2.01e3) 100+(2.14e3)

Proposed (Song et al., 2016) ACADMM 87(1.27e4) 14(2.18e3) 149(368) 44(118) 285(340) 41(88.0) 100+(2.14e3) 30(703) CRB-ADMM

40(5.99e3) 310(727) 73(127) 35(860)

3

10

ENRegression-Synthetic2

2

10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM 0

10

10

2

10

4

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

1

10

1

10

10

10

3

10

2

10

1

2

ENRegression-Synthetic2

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

104

105

Number of samples

Number of cores

Ø Acceleration by distributed computing

1)

= ↵(xk

k0 i

12

20

k

k i

10

18

ˆ k0 i

v k0

8

16

Initial penalty parameter

10

xk xk+1

ˆ k0 i

16

102

1

⌧k = 1/↵

(Id ; . . . ; Id ),

14

ENRegression-Synthetic2

3

10

10 10-2

min f (u) + g(v), subject to Au + Bv = b, where u = (u1 ; . . . ; uN ), A = IdN , B = PN and f (u) = i=1 fi (ui )

ˆk i

12

k i·

vk

Ø Robust to initial penalty, #cores, #data

assuming rF (x) = ↵x + a

u,v

6

8

EN regression MNIST Sparse logreg CIFAR10 SVM MNIST SDP Ham-9-5-6

1}.

Background: spectral stepsize

Ø Constrained problem in general form

6

10

ˆk i

Application

k 2 {(⌘i ) },

Ø The norm of the residuals converges to zero Ø The worst-case ergodic O(1/k) convergence rate in the variational inequality sense

8

10

i2{1,...,p}

4

Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+.

Iterations

2

=

k 1 k max{⌧i /⌧i

max

2

4

Ø Fast convergence on synthetic and benchmark dataset. More results in paper.

Iterations

2

(⌘ ) < 1, where (⌘ ) =

k 2 (⌘i )

Ø Classification/regression problem

k 2

2

Experiments

O (1/k) convergence with adaptivity k 2

uki 0

10

k+1 ⌧i

k+1 ui )

1 X

8i = 1, . . . , N

Ø Safeguarding convergence

2

k ⌧ + i /2 kv

i,

8

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

uki 0

uki

k

uki

20

0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5

v

k+1 i

k ↵i ·

20

k i,

+ ,

xk

1)

SVM-Synthetic2

3

10

SVM-Synthetic2

4

103 10

Seconds

ui ,v

g ˆ( )

and @ˆ g( ) = M

p k ⌧i = 1/ ↵i

2

4

18

fi (ui ) + g(v), subject to ui = v.

fˆ( )

Iterations

min

min f (A ) h , bi + g (B ) | {z } | {z }

k

16

Ø Objective

T

Ø Curvature estimation and safeguarding linear assumption

2

Bvk

⇤

T

Split equations about T, M↵ , M into blocks, and apply the spectral penalty proposition (Xu et al. AISTATS 2017) for each block

k 2

Au

⇤

where Consensus M↵ , MADMM arefordiagonal matrices. Adaptive Distributed Optimization

Ø Alternating direction method of multipliers (ADMM) k

1 Goldstein

@ fˆ( ˆ ) = M↵ ˆ +

where T = diag(⌧1 Id , . . . , ⌧N Id ) is a diagonal matrix, 2 and kxkT = xT x

k+1

4

Adaptive Consensus ADMM

Ø Saddle point problem by augmented Lagrangian max min f (u) + g(v) + h , b

and Tom

14

N X

3

Iterations

1

2

101

10

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

10

1

10

Number of cores

2

10

2

1

CADMM RB-ADMM AADMM CRB-ADMM ACADMM

101

102

Number of cores

Supplementary Material: Adaptive Consensus ADMM ...