Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step∗

Yair Carmon Department of Electrical Engineering Stanford University Stanford, CA 94305, USA [email protected]

John Duchi Department of Statistics Department of Electrical Engineering Stanford University Stanford, CA 94305, USA [email protected]

Abstract We consider the minimization of non-convex quadratic forms regularized by a cubic term, which exhibit multiple saddle points and poor local minima. Nonetheless, we prove that, under mild assumptions, gradient descent approximates the global minimum to within ε accuracy in O(ε−1 log(1/ε)) steps for large ε and O(log(1/ε)) steps for small ε (compared to a condition number we define), with at most logarithmic dependence on the problem dimension.

1

Introduction

We study the optimization problem minimize f (x) , x∈Rd

1 > ρ 3 x Ax + b> x + kxk , 2 3

(1)

where the matrix A is symmetric and possibly indefinite. The problem (1) arises in Newton’s method with cubic regularization, first proposed by Nesterov and Polyak [14]. The method consists of the iterative procedure   ρ 1 3 > 2 > xt+1 = argmin ∇g(xt ) (x − xt ) + (x − xt ) ∇ g(xt )(x − xt ) + kx − xt k (2) 2 3 x∈Rd for (approximately) minimizing a general smooth function g, requiring sequential solutions of problems of the form (1). The Nesterov-Polyak scheme (2) falls into the broader framework of trustregion methods [5, 3]. Such methods are among the most practically successful and theoretically sound approaches to non-convex optimization [5, 14, 3]. In particular, as Nesterov and Polyak first showed, under certain smoothness assumptions on g, it is possible to establish a rate at which ∇2 g converges to a positive semidefinite matrix. Standard methods for solving the problem (1) exactly require either factorization or inversion of the matrix A. However, the cost of these operations scales poorly with the problem dimensionality; in very large scale problems (such as, for example, training deep neural networks) even computing all the entries of the Hessian may be infeasible. Moreover, in methods that rely on matrix inversion or factorization it is often difficult to exploit matrix structure such as sparsity. In contrast, matrix-free methods, which access A only through matrix-vector products, often scale well to high dimensions and leverage structure in A (c.f. [18]), particularly when A is a Hessian. Even without special structure, the product ∇2 g(x)v often admits the finite difference approximation δ −1 (∇g(x + δv) − ∇g(x)), which requires only two gradient evaluations. In neural networks and other arithmetic circuits, back-propogation-like methods allow exact computation of Hessian-vector products at a * This is an extended abstract. A full version containing all the proofs is available online on arXiv.

similar cost [16, 17]. It is thus of practical and theoretical interest to explore matrix-free methods guaranteed to solve (1) efficiently. In this paper, we prove that gradient descent, perhaps the simplest matrix-free method, efficiently converges to the global minimum of the (“substantially non-convex” [14]) cubic problem (1), and we provide empirical evidence supporting our theoretical guarantees. More precisely, we show that gradient descent solves problem (1) to accuracy ε in at most O (1/ε) iterations, and that it exhibits linear convergence when ε is small with respect to problem conditioning. Our result implies that when gradient descent is used to approximate (2), the rate of convergence of ∇2 g is maintained, thus demonstrating a first-order method with second-order convergence—this is discussed in detail in the full version of the paper. A number of researchers have considered low-complexity methods methods for solving the problem (1). Cartis et al. [3] propose solution methods working in small Krylov subspaces, and Bianconcini et al. [2] apply a matrix-free gradient method (NMGRAD) in conjunction with an early stopping criterion for the problem. Both approaches exhibit strong practical performance, and they enjoy first-order convergence guarantees for the overall optimization method (in which problem (1) is an iteratively solved sub-problem). In both works, however, it appears challenging to give convergence guarantees for the iterative subproblem solvers, and they do not provide the second-order convergence rates of Nesterov and Polyak’s Newton method (2). In their paper [4], Cartis et al. give sufficient conditions for a low-complexity approximate subproblem solution to guarantee such second-order rates, but it is not clear how to construct a first-order method fulfilling these conditions. Closely related and intensely studied is the quadratic trust-region problem [5, 8, 9, 6], where one 3 replaces the regularizer (ρ/3) kxk with the constraint kxk ≤ R for some R > 0. Classical low complexity methods for this problem include subspace methods and the heuristic Steihaug-Toint truncated conjugate gradient method; we know of no convergence rates in either case. Recently, fast matrix-free approaches were proposed for the trust-region problem [10, 11] and, concurrently to this work, for the problem (1) [1]. These approaches reduce the problems to short sequences of approximate eigenvector computation and convex optimization problems; thus obtain√ ing accelerated rates of O(1/ ε), which are better than those we achieve when ε is large relative to problem conditioning. However, while these works indicate that solving (1) is never harder than approximating the bottom eigenvector of A, the regime of linear convergence we identify shows that it is sometimes much easier. In addition, we believe that our results provide interesting insights on the potential of gradient descent and other direct methods for non-convex problems. Another related line of work is the study of the behavior of gradient descent around saddle-points and its ability to escape them [7, 12, 15, 13]. A common theme in these works is an “exponential growth” mechanism that pushes the gradient descent iterates away from critical points with negative curvature, similar to the amplification of large eigen-directions in the classical power method. This mechanism plays a prominent role in our analysis as well, highlighting the implications of negative curvature for the dynamics of gradient descent.

2

Preliminaries and basic convergence guarantees

Before continuing we provide some notation to our approach to problem (1), where ρ > 0, b ∈ Rd and A ∈ Rd×d is symmetric and possibly indefinite, and k·k denotes the Euclidean norm. The eigenvalues of A are λ(1) (A) ≤ λ(2) (A) ≤ · · · ≤ λ(d) (A), where any of the eigenvalues λ(i) (A) may be negative. We define the eigengap of A by gap , λ(k) (A) − λ(1) (A) where k is the first eigenvalue of A strictly larger than λ(1) (A). For any vector v, v (i) denotes the ith coordinate of v in the eigen-basis of A. We let k·kop be the standard `2 -operator norm, so kAkop = maxu:kuk=1 kAuk, and define γ , −λ(1) (A), γ+ , γ ∨ 0, and β , kAkop = max{|λ(1) (A)|, |λ(d) (A)|}, so that the function f is non-convex if and only if γ > 0 (and is convex when γ+ = 0). We remark that our results continue to hold when β is an upper bound on kAkop rather than its exact value. We let s denote a solution to problem (1), i.e. a global minimizer of f that has the characterization [14, Section 5] as the solution to the equality and inequality ∇f (s) = (A + ρ ksk I)s + b = 0 and ρ ksk ≥ γ, (3) 2

and is unique whenever ρ ksk > γ. The global minimizer admits the following equivalent characterization whenever the vector b is not orthogonal to the eigenspace associated with λ(1) (A). Claim 1. If b(1) 6= 0, s is the unique point that satisfies ∇f (s) = 0 and b(1) s(1) ≤ 0. Additionally, the norm of s is upper bounded by s  s  2 2 γ β γ kbk β kbk ksk ≤ + + ≤ + + , R, 2ρ 2ρ ρ 2ρ 2ρ ρ

(4)

and admits the lower bound −bT Ab ksk ≥ Rc , + 2ρkbk2

s

bT Ab 2ρkbk2

2

kbk β + ≥− + ρ 2ρ

s

β 2ρ

2 +

kbk . ρ

(5)

The gradient descent method begins at some initialization x0 ∈ Rd and generates iterates x1 , x2 , ... according to xt+1 = xt − η∇f (xt ) = (I − ηA − ρη kxt k I)xt − ηb. (6) Throughout our analysis we make the following assumptions about the step size η and x0 . 1 . Assumption A. The step size η in (6) satisfies 0 < η ≤ 4(β+ρR) Assumption B. The initialization of (6) satisfies x0 = −rb/ kbk, with 0 ≤ r ≤ Rc . A key to our analysis is the fact that kxt k is monotonic. Lemma 1. Let Assumptions A and B hold. Then for all t ≥ 0, the iterates (6) of gradient descent satisfy xTt ∇f (xt ) ≤ 0, and the norms kxt k are non-decreasing and satisfy kxt k ≤ R. (1)

Assumption B guarantees that b(1) x0 ≤ 0 while Assumption A and the norm bound in Lemma 1 (1) (1) (1) together guarantee that b(1) xt ≤ ct b(1) xt−1 for some ct ≥ 0, and thus b(1) xt ≤ 0 for every t by induction. By standard arguments, limit points of gradient descent are critical point. Hence, when b(1) 6= 0 every partial limit of gradient descent satisfies Claim 1 and is therefore the unique global minimum s, which implies xt → s as t → ∞. This also allows us to strengthen the norm bound of Lemma 1 to kxt k ≤ ksk for every t.

3

Main result: non-asymptotic convergence rates

Theorem 1. Let Assumptions A and B hold, b(1) 6= 0, and ε > 0. Then f (xt ) ≤ f (s) + ε for all   s 2 2    1 10 ksk 1 10 ksk t ≥ Tε , τgrow (b(1) ) + τconverge (ε) min , , (7)  ρksk − γ η ε gap0 ε  where gap0 = (gap ∧ ρ ksk)I{ε≥10ksk2 (ρksk−γ)} and (1)

τgrow (b

 ) = 6 log 1 +

2 γ+ 4ρ|b(1) |

2

 and τconverge (ε) = 6 log

(β + 2ρ ksk) ksk ε

! .

Theorem 1 shows that the rate of convergence changes from √roughly O(1/ε) to O(log(1/ε)) as ε decreases, with an intermediate gap-dependent rate of O(1/ ε). The constants τgrow and τconverge correspond to a period (τgrow ) in which kxt k grows exponentially until reaching the basin of attraction to the global minimum and a period (τconverge ) of linear convergence to s. The term τgrow = 0 when the problem is convex (as γ ≤ 0), so we see that the period of exponential growth is directly related to the negative curvature in f . The dependence of our result on |b(1) | (and the implicit assumption b(1) = 6 0) can be eliminated ρε by adding to b a small perturbation uniformly distributed on a sphere of radius 12(β+2ρksk) , and 3

Figure 1. The purple curve is shaded according to the cdf of the number of iterations required to reach relative accuracy ε, computed over 2,500 random problem instances each with d = 104 , β = ρ = 1, γ = 0.5, gap = 5 · 10−3 and ρ ksk − γ = 5 · 10−4 . We use x0 = −Rc b/kbk and η = 0.25. The black, red and blue dashed curves indicate the three convergence regimes Theorem 1 identifies.

applying gradient descent on the modified problem instance. As we show in the full version of the paper, this guarantees that with probability at least 1 − δ, gradient descent finds an ε-suboptimal solution (to the unperturbed problem), in√a number iterations at most a constant times Tε defined 2 in (7), with |b(1) | replaced by ρ ksk δ/(4 d). Figure 1 depicts the number of gradient steps required to find a point x satisfying f (x) − f (s) ≤ ε(f (0) − f (s)) as a function of 1/ε, for random problem instances with a small value of ρ ksk − γ. The slopes in this log-log plot reveal good agreement between theory and experiment, and suggests there exist instances for which our upper bounds are tight up to sub-polynomial factors.

4

Proof outline

We provide a brief overview of the proof of the linear convegence part of Theorem 1. A comprehensive treatment of the proof appears in the full version of this paper. We tacitly let Assumptions A and B hold throughout the section. Our first result has the structure of a linear convergence guarantee. Lemma 2. For each t > 0, we have     ρ ksk − γ 2 2 2 kxt − sk ≤ 1 − η ρ kxt k − γ − kxt−1 − sk − ηρ (ksk − kxt−1 k) ksk . 2 The above recursion implies geometric decrease in kx − sk only when ρ kxt k is larger than 1 2 (ρ ksk − γ), which may be non-trivial for non-convex problem instances (with γ > 0). Using the fact that kxt k is non-decreasing (Lemma 1), Lemma 2 immediately implies the following. Lemma 3. If ρkxt k ≥ γ − 21 (ρ ksk − γ) + δ for some t ≥ 0, then for all τ ≥ 0, τ

2

kxt+τ − sk2 ≤ (1 − ηδ) kxt − sk2 ≤ 2 ksk e−ηδτ . It remains to show that ρkxt k will quickly exceed any level below γ. Fortunately, as long as ρkxt k (1) is below γ − δ, |xt | grows faster than (1 + ηδ)t , which the next lemma leverages.   γ2 2 Lemma 4. Let δ > 0. Then ρkxt k ≥ γ − δ for all t ≥ ηδ log 1 + 4ρ|b+(1) | . To give the linear convergence regime of Theorem 1, we first apply Lemma 4 with δ = 1 3

(ρ ksk − γ), which yields ρ kxt k ≥ γ −

1 3

(ρ ksk − γ) for every t ≥

τgrow (b(1) ) η(ρksk−γ) .

We may then

apply Lemma 3 and standard smoothness arguments to obtain f (xt ) ≤ f (s) + ε after additional gradient descent iterations.

4

τconverge (ε) η(ρksk−γ)

References [1] N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima for nonconvex optimization in linear time. arXiv:1611.01146 [math.OC], 2016. [2] T. Bianconcini, G. Liuzzi, B. Morini, and M. Sciandrone. On the use of iterative methods in cubic regularization for unconstrained optimization. Computational Optimization and Applications, 60(1):35–57, 2015. [3] C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, Series A, 127:245–295, 2011. [4] C. Cartis, N. I. Gould, and P. L. Toint. Complexity bounds for second-order optimality in unconstrained optimization. Journal of Complexity, 28(1):93–108, 2012. [5] A. R. Conn, N. I. M. Gould, and P. L. Toint. Trust Region Methods. MPS-SIAM Series on Optimization. SIAM, 2000. [6] J. B. Erway and P. E. Gill. A subspace minimization method for the trust-region step. SIAM Journal on Optimization, 20(3):1439–1461, 2009. [7] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. arXiv preprint arXiv:1503.02101, 2015. [8] N. I. M. Gould, S. Lucidi, M. Roma, and P. L. Toint. Solving the trust-region subproblem using the Lanczos method. SIAM Journal on Optimization, 9(2):504–525, 1999. [9] N. I. M. Gould, D. P. Robinson, and H. S. Thorne. On solving trust-region and other regu˘ S– larised subproblems in optimization. Mathematical Programming Computation, 2(1):21âA ¸ 57, 2010. [10] E. Hazan and T. Koren. A linear-time algorithm for trust region problems. Mathematical ˘ S–381, Programming, Series A, 158(1):363âA ¸ 2016. [11] N. Ho-Nguyen and F. Kılınc˛-Karzan. A second-order cone based approach for solving the trust-region subproblem and its variants. arXiv:1603.03366 [math.OC], 2016. [12] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent converges to minimizers. arXiv:1602.04915 [stat.ML], 2016. [13] K. Y. Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint arXiv:1611.04831, 2016. [14] Y. Nesterov and B. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, Series A, 108:177–205, 2006. [15] I. Panageas and G. Piliouras. Gradient descent converges to minimizers: The case of nonisolated critical points. arXiv preprint arXiv:1605.00405, 2016. [16] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147– 160, 1994. [17] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):1723–1738, 2002. [18] L. N. Trefethen and D. Bau III. Numerical linear algebra. SIAM, 1997.

5

Gradient Descent Efficiently Finds the Cubic ...

at most logarithmic dependence on the problem dimension. 1 Introduction. We study the .... b(1) = 0 every partial limit of gradient descent satisfies Claim 1 and is therefore the unique global minimum s, which ... The slopes in this log-log plot reveal good agreement between theory and experiment, and suggests there exist ...

1MB Sizes 0 Downloads 222 Views

Recommend Documents

Functional Gradient Descent Optimization for ... - public.asu.edu
{v1,...,vp} of Vehicles Under Test (VUT). The state vector for the overall system is x ..... [2] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and. H. Winner ...

Hybrid Approximate Gradient and Stochastic Descent ...
and BREACH [13] are two software toolboxes that can be used for falsification of .... when we deal with black-box systems where no analytical information about ...

cost-sensitive boosting algorithms as gradient descent
Nov 25, 2008 - aBoost can be fitted in a gradient descent optimization frame- work, which is important for analyzing and devising its pro- cedure. Cost sensitive boosting ... and most successful algorithms for pattern recognition tasks. AdaBoost [1]

cost-sensitive boosting algorithms as gradient descent
Nov 25, 2008 - on training data. ... (c) Reweight: update weights of training data w(t) i. = w(t−1) .... where ai,bi are related to the cost parameters and label infor-.

A Gradient Descent Approach for Multi-modal Biometric ...
A Gradient Descent Approach for Multi-Modal Biometric Identification. Jayanta Basak, Kiran Kate,Vivek Tyagi. IBM Research - India, India bjayanta, kirankate, [email protected]. Nalini Ratha. IBM TJ Watson Research Centre, USA [email protected]. Abst

Functional Gradient Descent Optimization for Automatic ...
Before fully or partially automated vehicles can operate ... E-mail:{etuncali, shakiba.yaghoubi, tpavlic, ... from multi-fidelity optimization so that these automated.

A Block-Based Gradient Descent Search Algorithm for ...
is proposed in this paper to perform block motion estimation in video coding. .... shoulder sequences typical in video conferencing. The NTSS adds ... Hence, we call our .... services at p x 64 kbits,” ITU-T Recommendation H.261, Mar. 1993.

Hybrid Approximate Gradient and Stochastic Descent for Falsification ...
able. In this section, we show that a number of system linearizations along the trajectory will help us approximate the descent directions. 1 s xo. X2dot sin.

Gradient Descent Only Converges to Minimizers: Non ...
min x∈RN f (x). Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk),. (1) ... Page 5 ... x∗ is a critical point of f if ∇f (x∗) = 0 (uncountably many!). ▷ x∗ is ...

The descent 3
Page 1 of 15. Emilys hopesand fears.Assassination game bluray.73652868351 - Download The descent 3.Miss no good.Because oftheir hexagonalshapethey. tessellateso many can be placed togetheralong aslope. [IMAGE] The 1953 Floods Thefloods happened the d

THE FERMAT CUBIC, ELLIPTIC FUNCTIONS ...
... to Dixonian functions can be read combinatorially through the glasses of a .... The only direct references that I have come across elsewhere are certain pas-.

GRADIENT IN SUBALPINE VVETLANDS
º Present address: College of Forest Resources, University of Washington, Seattle .... perimental arenas with alternative prey and sufficient habitat complexity ...... energy gain and predator avoidance under time constraints. American Naturalist ..

Kant Finds Nothing Ugly
signify a subtraction but an addition, in exactly the same way as the sign “+”, as ... Opposites can be represented mathematically along one straight line, with.

DDos finds new vectors.pdf
Accessed October 04, 2016. https://www.britannica.com/topic/denial-of- service-attack. 2 “Denial of Service Attacks (Published 1997).” Denial of Service (Published 1997). Accessed October 04, 2016. http://www.cert.org/information-for/. denial_of_

EBOOK The Paper Princess Finds Her Way - Elisa ...
The princess must choose: safety or unknown adventure? Though a cat's sharp claws, a baby's sticky mouth, and blustering winter storms threaten the princess as she makes her way, she never gives up hope. Finally, carried on a cloud of butterfly wings

The Complex Gradient Operator and the CR ... - Semantic Scholar
within a complex variables framework, a direct reformulation of the problem to the real domain is awkward. ...... and the other (the c-real form) involving the c-real Hessian matrix. HR cc(υ). ∂. ∂c. (. ∂f(υ). ∂c. )T for υ, c ∈C≈ R2n.

The Complex Gradient Operator and the CR-Calculus - CiteSeerX
Although one can forgo the tools of the CR-calculus in the case ...... for differentiation purposes because, as mentioned earlier, in the complex ...... [14] “A Complex Gradient Operator and its Application in Adaptive Array Theory,” D.H. Brand-.

The Lemoine Cubic and Its Generalizations
May 10, 2002 - and efficient help. Without them, this paper would never have been the ..... APa, BPb, CPc. The tangents at P to the cubic are tangent to both ...

THE E-THEORETIC DESCENT FUNCTOR FOR ...
local r − G-sets (cf. [32, p.10], [25, p.44]). This means that for each γ0 ∈ Γ, there exists an open neighborhood U of r(γ0) in X and a subset W of Γ containing γ0 such that rW = r|W is a homeomorphism from W onto U. Most locally compact gro