Non-convex Optimization with Frank-Wolfe Algorithm ...

Viewer
Transcript

Non-convex Optimization with Frank-Wolfe Algorithm and Its Variants Jean Lafond, Hoi-To Wai and Eric Moulines∗

Abstract Recently, Frank-Wolfe (a.k.a. conditional gradient) algorithm has become a popular tool for tackling machine learning problems as it avoids the costly projection computation in traditional first-order optimization methods. While the Frank-Wolfe (FW) algorithm has been extensively studied for convex optimization, little is known for the FW algorithm in non-convex optimization. This paper presents a unified convergence analysis for FW algorithm and its variants under the setting of nonconvex but smooth objective with a convex, compact constraint set. Our results are based on a novel observation on the so-called Frank-Wolfe gap (FW gap), which measures the closeness of solution p to a stationary point. With a diminishing step size, we show that the FW gap decays at a rate of O( 1/t); and the same rate holds for variants of FW such as the online FW algorithm and decentralized FW algorithm. Numerical experiments are shown to support our findings.

1

Introduction

Let f : Rd → R be a continuously differentiable (possibly non-convex) function and C ⊆ Rd be a closed and bounded convex set, we consider the following optimization problem: minθ F (θ) s.t. θ ∈ C .

(1)

This paper studies the Frank-Wolfe (FW) algorithm that has become popular recently due to its projection-free feature. Comparing to traditional projected gradient algorithms (PGAs), the FW algorithm involves solving a linear optimization (LO) that can be performed much more efficiently than the projection step required by PGA; see [1]. Previous research have focused on convex optimization with FW/FW-based algorithms. For example, [2, 3] studied conditions √ when FW algorithm converges at a linear rate; [4] studied an Online FW algorithm with a regret bound of O(1/ T ), where T is the number of rounds played; [5] combined FW algorithm with the popular stochastic variance reduction gradient (SVRG) method to efficiently handle finite-sum optimization problems. On the other hand, little is known on non-convex optimization with FW/FW-based algorithms. Recent results can be found in several unpublished works, e.g., [6] considered an adaptive step size rule in FW algorithm to yield a √ convergence rate of O(1/ t) which is similar to ours; [7] studied a fixed step size rule but achieved a slightly worse convergence rate than ours; [8] applied the SVRG techniques on FW algorithm with non-convex objectives. Contributions. This paper presents a unified analysis on the convergence of FW algorithm(s) for non-convex optimization. Under the setting of smooth objective function and a bounded convex constraint set, we show that the limit points of the√iterates generated by FW algorithms are stationary points of (1), and they can be found at the fastest rate of O(1/ T ). We also provide additional conditions when the convergence rate can be accelerated. Lastly, we demonstrate an interesting application on sparse+low rank matrix completion and provide numerical experiments to support our findings. Notations. For d ∈ N, we denote the set {1, ..., d} as [d]. The ith element of a vector θ is [θ]i . The Euclidean norm is denoted by k · k. A function f is L-smooth if f (θ) − f (θ 0 ) ≤ h∇f (θ 0 ), θ − θ 0 i + Lkθ − θ 0 k2 /2 for all θ, θ 0 ∈ Rd . ∗ J. Lafond is with Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, France. H.-T. Wai is with School of ECEE, Arizona State University, USA. The first two authors have contributed equally. E. Moulines is with CMAP, Ecole Polytechnique, France. Emails: [email protected],[email protected],[email protected]

29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Preliminaries. Consider the classical FW algorithm applied to (1). Let θt ∈ C be a feasible solution to (1) and γt ∈ (0, 1] be some step size, at iteration t ∈ N, we perform: θt+1 = θt + γt (at − θt ), where at := min h∇F (θt ), ai . a∈C

(2)

The latter optimization in (2) is known as the linear optimization (LO) step required by FW algorithm(s). This can be viewed as the projection-free counterpart to the Euclidean projection step required by traditional PGA, i.e., minθ∈C kθ 0 − θk. In many interesting cases, the LO step admits a more efficient solution than the projection. For convex problems, a well known fact is that the objective value of the FW algorithm converges to the minimum at a rate of at least O(1/t) [1], i.e., f (θt ) − f (θ ? ) = O(1/t), where θ ? is an optimum solution to (1). This rate can be accelerated under some conditions, e.g., see [2, 3]. For non-convex problems, it is not possible to take the objective values’ differences as a benchmark. Instead, we focus on the following FW/Duality gap: gt := max h∇f (θt ), θt − θi = h∇f (θt ), θt − at i . θ∈C

(3)

Importantly, if gt = 0, then h∇f (θt ), θt − θi ≤ 0 for all θ ∈ C, i.e., θ t is a stationary point to Problem (1). Like [6, 7, 8], we can take gt as a measure of the stationarity of the iterate θt . As mentioned, the convergence of the FW algorithm for non-convex objective functions has only been considered recently by a few authors [6, 7, 8], and they studied the convergence √ rate of FW algorithm in terms of gt . To our knowledge, the best convergence rate available to-date is gt = O(1/ t). In the following analysis, we provide additional guarantees for the FW algorithm and show that a similar convergence guarantee also holds for a variety of FW-based algorithms.

2

Main Results

Consider a generalization of the FW algorithm (2) when the objective function is time varying, i.e., at iteration t, the objective function is denoted by Ft (θ). Furthermore, the solution at to the LO maybe erroneous, this maybe due to inaccuracies in the LO solver, or simply an inexact gradient obtained. We perform the tth FW update as ˆ t (θt ), ai, a ˆ t ≈ at := arg min h∇F ˆt ∈ C , θt+1 = θt + γt (ˆ at − θt ), where a a∈C

(4)

ˆ t (θt ) is a noisy gradient available to us. We consider the following assumptions: where ∇F H1. The time varying objective function satisfies |Ft (θ) − Ft−1 (θ)| ≤ Cb · t−β for all θ ∈ C, t ≥ 1 and some β > 0. ˆ t − at i ≤ Cg · t−η for all t ≥ 1 and some η > 0. H2. The inexact LO solution satisfies h∇Ft (θt ), a Obviously, the FW algorithm in (2) is a special case of (4) satisfying H1 and H2 with Cg = Cb = 0, η = β = 1. The main result of this paper is summarized as (the non-asymptotic constants will be found in the appendix): Theorem 1. Choose the step size as γt = t−α for some α ∈ [0.5, 1). Assume H1, H2 and that each of Ft is L-smooth, bounded by B over C. Then the following hold for the general FW algorithm (4) — (i) for any T ≥ 6, min

max h∇Ft (θt ), θt − θi = O(1/T min{1−α,β−α,η} ) ,

t∈[T /2+1,T ] θ∈C

(5)

(ii) In addition, exactly one of the following holds: for any T ≥ 2, (a) max h∇Ft (θt ), θt − θi ≤ Cg · t−η + (L¯ ρ2 /2) · t−α = O(1/tmin{η,α} ), for some t ∈ [T /2 + 1, T ] θ∈C

(6)

(b) Ft (θt+1 ) < Ft (θt ), ∀ t ∈ [T /2 + 1, T ] , i.e., the FW gap bound can be improved to O(1/T min{η,α} ) or the objective value will be monotonically decreasing for the epoch considered. (iii) Finally, when Cb = 0, i.e., Ft (θ) = F (θ) for all t ≥ 1, η + α > 1 and α > 0.5. We further ¯ takes a finite number of values for the stationary points θ, ¯ then the sequence {θt }t≥1 has limit points assume that F (θ) and each limit point θ satisfies max h∇F (θ), θ − θi = 0 . (7) θ∈C

We relegate the proof of Theorem 1 to Section 3 and Appendix A. Notice that if β ≥ 1, η ≥ α and we set α = 0.5, then √ a convergence rate of O(1/ T ) is achieved. This matches the rate for PGA on non-convex problems in [9]. Moreover, in our numerical experiment, we observe that the FW gap often decays at O(1/tα ) for α > 0.5, this can be accounted for using (6). Below we list a few examples that satisfy H1, H2 and thus can be analyzed using our results. 2

2.1

Online FW (O-FW) algorithm

Like [4], we consider a fully informational setting for the online FW (O-FW) algorithm. At round/iteration t, an online learner plays θt and receives full information about the instantaneous loss function ft (θ). For example, ft (θ) may correspond to the data observed at the current round. To account for loss functions from the past, we design the time Pthe t −1 := varying objective function as the temporal average Ft (θ) t s=1 fs (θ). Now, as Ft (θ) is fully known, its gradient ∇Ft (θ) can be exactly evaluated at round t. As such, the FW algorithm in (4) can be directly applied with H2 automatically satisfied with Cg = 0. If ft (θ) is bounded by B for all θ ∈ C, then |Ft (θ) − Ft−1 (θ)| =

t−1 X 2B 1 1 1 1 ft (θ) + − , ∀t≥1, fs (θ) = ft (θ) − Ft−1 (θ) ≤ t t t−1 t t s=1

(8)

i.e., H1 is satisfied with β = 1, Cb = 2B. Consequently, the results from Theorem 1 applies directly. 2.2

Decentralized FW (DeFW) algorithm

In the decentralized FW (DeFW) algorithm [10], we consider solving an optimization problem of the form: PN minθ (1/N ) i=1 fi (θ) s.t. θ ∈ C .

(9)

The problem is to be solved distributively by a network of N agents, each of them holding a private objective function fi (θ). Our goal is for the agents to cooperatively find a stationary point to the above problem through exchanging information over network. Following standard set-ups in distributed optimization [11], we assume that the network is described by an undirected graph G = (V, E), where V = [N ] and E ⊆ V × V . The graph is associated with a doubly stochastic weight matrix ×N W ∈ RN such that [W ]ij = 0 if and only if (i, j) ∈ / E. To describe the DeFW algorithm, we denote θti as the local + copy of θt kept by the ith agent at iteration t. We perform the following updates in order — for each i ∈ [N ], PN PN j (10a) θ¯ti = j=1 Wij · θti , ∇it F = j=1 Wij · ∇jt−1 F − ∇fj (θ¯t−1 ) + ∇fj (θ¯tj ) , i θt+1 = θ¯ti + γt (ait − θ¯ti ), where ait = arg min h∇it F , ai . a∈C

(10b)

Notice that (10a) represents the gossip-based average consensus (GAC) updates [12] (executed for one round) for averaging the parameter variables and the gradient vectors; while (10b) is the standard FW update. PN We analyze the convergence of the DeFW algorithm (10) by studying the average iterate θ¯t := N −1 j=1 θtj . Using PN (10), we see that θ¯t+1 = θ¯t + γt (N −1 aj − θ¯t ) and the DeFW algorithm can be analyzed under the framework j=1

t

of (4). Now, H1 is satisfied with Cb = 0 since the objective function is not time varying. Secondly, if kθk ≤ ρ for all θ ∈ C, the following inequality holds for all i ∈ [N ]: h∇F (θ¯t ), ait i ≤ h∇it F , ait i + ρk∇it F − ∇F (θ¯t )k ≤ h∇it F , at i + ρk∇it F − ∇F (θ¯t )k ≤ h∇F (θ¯t ), at i + 2ρk∇it F − ∇F (θ¯t )k ,

(11)

where the first and last inequalities are due to Cauchy-Schwarz, and the second inequality is due to the optimality of ati . If each of fi is L-smooth, it can be proven that k∇it F − ∇F (θ¯t )k ≤ γt Cg0 for some Cg0 < ∞; see [10]. Therefore, when we set γt = t−α , then H2 is satisfied with η = α, Cg = 2ρCg0 and thus Theorem 1 applies. Lastly, we remark that kθ¯ti − θ¯t k also decays at the order of O(γt ) [10], therefore the results in Theorem 1 applies to each local variable.

3

Sketch of Proof for Theorem 1

We only sketch the proof of (5) and (6) in Theorem 1, the proof for (7) will be postponed to the appendix. Let us define ρ¯ := maxθ,θ0 ∈C kθ − θ 0 k as the diameter of C. Now as Ft is L-smooth and using H2, we have L¯ ρ2 L¯ ρ2 ≤ Ft (θt ) + γt h∇Ft (θt ), at − θt i + γt Cg · t−η + γt2 2 2 Cg · t−η + (L¯ ρ2 /2) · t−α − max h∇Ft (θt ), θt − θi (12)

ˆ t − θt i + γt2 Ft (θt+1 ) ≤ Ft (θt ) + γt h∇Ft (θt ), a = Ft (θt ) + t−α

θ∈C

3

To obtain (5), we sum up the both sides of (12) from t = T /2 + 1 to t = T to yield, T T T X X X t−α h∇Ft (θt ), θt − at i ≤ t−α Cg · t−η + (L¯ ρ2 /2) · t−α + (Ft (θt ) − Ft (θt+1 )) (13) t=T /2+1

t=T /2+1

t=T /2+1

As h∇Ft (θt ), θt − at i is non-negative for all t, the left hand side can be lower bounded by Ω(T 1−α ) · mint∈[T /2+1,T ] maxθ∈C h∇Ft (θt ), θt − θi. Meanwhile, the first summation on the right can be upper bounded by O(T 1−min{2α,α+η} ) and the second summation on the right can be bounded as: FT /2+1 (θT /2+1 ) − FT /2+1 (θT /2+2 ) + FT /2+2 (θT /2+2 ) − · · · − FT −1 (θT ) + FT (θT ) − FT (θT +1 ) PT −1 ≤ FT /2+1 (θT /2+1 ) − FT (θT +1 ) + t=T /2+1 Cb · t−β = O(T 1−min{1,β} )

(14)

We notice that the above results hold as the summation is taken from t = T /2 + 1 to t = T instead of t = 1 to t = T . Thus, the right hand side of (13) can be bounded by O(T 1−min{1,β,2α,α+η} ), dividing with Ω(T 1−α ) yields (5). The results in (6) can be observed directly from (12). In particular, when statement (a) in (6) is violated, i.e., Cg t−η + (L¯ ρ2 /2)t−α < maxθ∈C h∇Ft (θt ), θt − θi for all t ∈ [T /2 + 1, T ], then Ft (θt+1 ) < Ft (θt ) and statement (b) holds. Otherwise, statement (a) holds whenever statement (b) is violated.

4

Numerical Experiment

To illustrate our analytical findings, we consider a non-convex matrix completion problem where the observations are contaminated with sparse noise. This is related to the so-called sparse+low rank matrix completion formulation [13]. Let σ > 0, R > 0 be a fixed parameter, we consider: P minθ∈Rm1 ×m2 (k,l)∈Ω 1 − exp − (Yk,l − [θ]k,l )2 /σ s.t. kθk? ≤ R , (15)

5

3

Duality MSE

Duality

where Yk,l is the noisy observations on the (k, l)th entry of the matrix to be estimated, Ω ⊆ [m1 ] × [m2 ] is the set of observed entries’ locations, k · k? is the nuclear norm and we promote a low rank solution by the nuclear norm constraint in (15). The negated Gaussian loss gives better tolerance to outlier entries over the standard square loss.

MSE

5.0 5.0 10 10 We consider the movielens100k dataset which contains Sq. (Std. FW) Gau. (Std. FW) 4 10 records of movie ratings from m1 = 943 users on 4.5 4.5 Gau. (DeFW, 1GAC) Gau. (DeFW, 3GAC) 104 m2 = 1682 movies. We assign 8 × 103 (resp. 2 × 103 ) Gau. (O-FW) 4.0 4.0 FW gap (Std. FW) records for the training (resp. testing) purpose. The stanFW gap (DeFW, 1GAC) 102 FW gap (DeFW, 3GAC) 3 3.5 dard FW, O-FW, DeFW algorithms are tested in the ex10 3.5 FW gap (O-FW) periment. Specifically, the O-FW algorithm takes a batch 3.0 of B = 20 new records at each round, and the DeFW 3.0 102 2.5 algorithm is simulated on a network with N = 50 agents, 101 2.5 connected via an Erdos-Renyi graph with connectivity 0.1. 2.0 The weight matrix W for the DeFW algorithm is designed 101 2.0 1.5 using the Metropolis-Hastings rule. We consider the case 1.5 when ` = 1, 3 GAC information exchange rounds are per1.0 100 100 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 formed per iteration in the DeFW algorithm. In addition, Iteration number Iteration number we also test the implementations with a standard square loss. We set the step size to γt = 2/(t + 1) for the convex Fig. 1. Convergence of FW algorithms on the movielens100k square loss; γt = t−α with α = 0.75 for the non-convex dataset. (Left) noiseless observations. (Right) sparse-noise contaminated observations (20% sparsity). Same legend are used for negated Gaussian loss; and we set R = 104 .

the right figure.

The numerical results are presented in Fig. 1, where we show the test mean square error (MSE) and the FW gap versus the iteration/round number. Notice that the O-FW algorithm is terminated after 4000 rounds since we only have 8 × 104 training data. As seen, for the non-convex loss, the FW gaps of the tested algorithms decreases with the iteration number at an order of ∼ O(1/t0.75 ) and achieves better MSE with the sparse noise data than the standard square loss formulation. The observed convergence rate corroborates with our analysis in Theorem 1. Conclusions. This paper presents a unified analysis for FW/FW-like algorithms on non-convex optimizations. Open problem includes a more in-depth investigation to the accelerated convergence rate observed in (6). We also remark that with a slight modification, our analysis can be applied to the away-step algorithm in [3] and its online variant. 4

References [1] M. Jaggi, “Revisiting Frank-Wolfe: Projection-free sparse convex optimization,” in ICML, 2013. [2] S. Lacoste-Julien and M. Jaggi, “An affine invariant linear convergence analysis for Frank-Wolfe algorithms,” NIPS, 2013. [3] ——, “On the global linear convergence of Frank-Wolfe optimization variants,” in NIPS, 2015. [4] E. Hazan and S. Kale, “Projection-free online learning,” ICML, 2012. [5] E. Hazan and H. Luo, “Variance-reduced and projection-free stochastic optimization,” in ICML, 2016. [6] S. Lacoste-Julien, “Convergence rate of frank-wolfe for non-convex objectives,” CoRR, July 2016. [7] Y. Yu, X. Zhang, and D. Schuurmans, “Generalized conditional gradient for sparse estimation,” CoRR, Oct 2014. [8] S. J. Reddi, S. Sra, B. Poczos, and A. Smola, “Stochastic frank-wolfe methods for nonconvex optimization,” CoRR, July 2016. [9] S. Ghadimi and G. Lan, “Accelerated gradient methods for nonconvex nonlinear and stochastic programming,” Mathematical Programming, vol. 156, no. 1, pp. 59–99, Feb 2015. [10] H.-T. Wai, J. Lafond, A. Scaglione, and E. Moulines, “Decentralized projection-free optimization for convex and non-convex problems,” in preparation, 2016. [11] J. Tsitsiklis, “Problems in decentralized decision making and computation,” Ph.D. dissertation, Dept. of Electrical Engineering and Computer Science, M.I.T., Boston, MA, 1984. [12] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE Trans. Inf. Theory, vol. 52, no. 6, pp. 2508–2530, Jun. 2006. [13] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Sparse and low-rank matrix decompositions,” in Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on. IEEE, 2009, pp. 962–967. [14] E. A. Nirminskii, “Convergence conditions for nonlinear programming algorithms,” Cybernetics, no. 6, pp. 79–81, Nov 1972.

5

A

Proof of Theorem 1

We supply details to the proof sketch in Section 3 by identifying the non-asymptotic constants. Let us begin from (13). As γt = t−α , we can lower bound the left hand side by: T X

t−α h∇Ft (θt ), θt − at i ≥

t=T /2+1

T X

t−α ·

min t∈[T /2+1,T ]

t=T /2+1

h∇Ft (θt ), θt − at i

1−α ! 2 1− · min h∇Ft (θt ), θt − at i , 3 t∈[T /2+1,T ]

T 1−α ≥ 1−α

(16)

for all T ≥ 6. On the other hand, the right hand side of (13) can be bounded as T X

t−α Cg · t−η + (L¯ ρ2 /2) · t−α ≤ (Cg + (L¯ ρ2 /2)) ·

t=T /2+1

T X

t− min{2α,η+α} ,

(17)

t=T /2+1

and using (14), T X

(Ft (θt ) − Ft (θt+1 )) ≤ FT /2+1 (θT /2+1 ) − FT (θT +1 ) +

t=T /2+1

T −1 X

Cb · t−β .

(18)

t=T /2+1

For any 1 > δ > 0, the latter summations in the above can be bounded as Z T T X 1 1−δ T 1−δ −δ 1− . t ≤ t−δ dt ≤ 1−δ 2 T /2

(19)

t=T /2+1

For δ ≥ 1, the summation is bounded by bound as T X

t

−δ

≤T

max{0,1−δ}

PT

t=T /2+1

t−δ ≤

RT T /2

C(δ), where C(δ) :=

t=T /2+1

t−δ dt ≤ log 2. Therefore, we can write the upper

(1 − (1/2)1−δ )/(1 − δ) if 0 < δ < 1 , log 2 if δ ≥ 1 .

(20)

Notice that the bound is decreasing with δ. Consequently, let FT /2+1 (θT /2+1 ) − FT (θT +1 ) ≤ 2B since the objective values are bounded and we have min

max h∇Ft (θt ), θt − θi ≤

t∈[T /2+1,T ] θ∈C

1−α !−1 (21) 2 L¯ ρ2 1− · 2B + Cg + Cb + · C(min{2α, η + α, β}) · T − min{1−α,η,β−α} . 3 2 We now prove the third statement in the Theorem. Define the following set of stationary points to (1): ¯ θ¯ − θi = 0} . C ? := {θ¯ ∈ C : maxθ∈C h∇F (θ),

(22)

Theorem 2. [14, Theorem 1] For an arbitrary convergent subsequence {θst }t≥1 in C with the limit point θ. If the following conditions hold: A1. it holds that limt→∞ kθt+1 − θt k = 0 , A2. if θ ∈ / C ? , there exists 0 > 0 such that for all 0 < ≤ 0 , the integer quantity τt is finite with τt := min s s.t. kθs − θst k > , s>st

(23)

A3. taking the same τt defined above, there exists a continuous function W (θ) that takes a finite number of values in C ? such that lim sup W (θτt ) < lim W (θst ) , (24) t→∞

t→∞

then the sequence {W (θt )}t≥1 converges and the limit points of the sequence {θt }t≥1 belong to the set C ? . 6

Our plan is to apply the theorem above to prove (7). We first observe that as C is closed and bounded, by BolzanoWeierstrass theorem there exists a convergent subsequence {θst }t≥1 of the sequence of iterates generated by the generalized FW algorithm (4). Moreover, condition A1 can be easily verified since kθt+1 − θt k ≤ γt kˆ at − θt k ≤ γtρ¯ ,

(25)

and γt → 0 as t → ∞. Now, let θ be the limit of the subsequence {θst }t≥1 and θ ∈ / C ? . We shall verify condition A2 in Theorem 2 by contradiction. In particular, we assume that the following holds for all 0 < ≤ 0 : kθs − θst k ≤ , ∀ s > st .

(26)

For some sufficiently large k and s > st , since {θst }t≥1 converges to θ as t → ∞, we have θs ∈ B2 (θ), i.e., the ball of radius 2 centered at θ. Furthermore, as θ ∈ / C ? and C ? is closed, the following holds, h∇F (θs ), θ − θs i ≤ −δ < 0, ∀ θ ∈ C, ∀ s > st ,

(27)

for some δ > 0. In particular, we have h∇F (θs ), as − θs i ≤ −δ. From (12), H1 and H2, it holds true for all t ≥ 1 that: 1 (28) ρ2 . F (θt+1 ) − F (θt ) ≤ γt · h∇F (θt , at − θt i + γt Cg · t−η + γt2 L¯ 2 To arrive at a contradiction, we let s > st and sum up the both side of the above from t = st to t = s. Consider the following chain of inequality: F (θs ) − F (θst ) ≤

s X

γ` · (∇F (θ` ), a` − θ` i + Cg · t−η + (L¯ ρ2 /2) · t−α )

`=st

≤ −δ

s X

γ` +

`=st

s X

(29) −α

`

−η

(Cg · `

−α

2

+ (L¯ ρ /2) · `

),

`=st

where the second inequality is due to (27). Letting s → ∞ and observe that

Ps

`=st

γ` → +∞ implies

lim F (θs ) − F (θst ) < −∞ ,

(30)

s→∞

Ps

since lims→∞ `=st `−α (Cg · `−η + (L¯ ρ2 /2) · `−α ) < ∞, which is due to η + α > 1 and 2α > 1. This leads to a contradiction since F (θ) is bounded over C. We conclude that condition A2 holds for the FW algorithm. The remaining task is to verify condition A3. We shall take W (·) = F (·). By the definition of τt , we have θs ∈ B (θst ) for all st ≤ s ≤ τt − 1. Again for some sufficiently large t, we have θs ∈ B (θst ) ⊆ B2 (θ) and the inequality (29) holds for s = τt − 1. This gives: F (θτt ) − F (θst ) ≤

τX t −1

γ` · (−δ + Cg · `−η + (L¯ ρ2 /2) · `−α ) .

(31)

`=st

On the other hand, we have θτt ∈ / B (θst ) and thus < kθτt − θst k ≤

τX t −1

γ` kˆ a` − θ` k ≤ ρ¯

`=st

τX t −1

γ` .

(32)

`=st

Pτt −1 The above implies that `=s γ` > /¯ ρ > 0. Considering (31) again, observe that the latter two terms decay to zero, t for some sufficiently large t, we have −δ + O(`− min{η,α} ) ≤ −δ 0 < 0 if ` ≥ st . Therefore, (31) leads to F (θτt ) − F (θst ) ≤ −δ 0

τX t −1 `=st

γ` < −

δ0 <0. ρ¯

Taking the limit t → ∞ on both sides lead to (24) and completes the proof. 7

(33)

NEXT: In-Network Nonconvex Optimization - IEEE Xplore

Exploiting Structure for Tractable Nonconvex Optimization

genetic algorithm optimization pdf

PID Parameters Optimization by Using Genetic Algorithm Andri ... - arXiv

Optimization of Pattern Matching Algorithm for Memory ...

Ideology algorithm: a socio-inspired optimization methodology (PDF ...

A fast optimization transfer algorithm for image ...

Optimization of Pattern Matching Algorithm for Memory Based ...

Optimization of String Matching Algorithm on GPU

Optimization of Pattern Matching Algorithm for Memory Based ...

Discrete Binary Cat Swarm Optimization Algorithm - IEEE Xplore

Implemented Cryptographic Symmetric Algorithm with ...