Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions Ioannis Panageas MIT-SUTD Previously Georgia Tech

joint work with Georgios Piliouras (SUTD)

Outline

Basics

Results and Proof sketch

Examples

Previous and Future work

Definitions Problem Let f : RN → R and f is C 2 : min f (x).

x∈RN

Definitions Problem Let f : RN → R and f is C 2 : min f (x).

x∈RN

Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk ),

(1)

with constant α > 0. A discrete dynamical system xk+1 = g(xk ).

Definitions Problem Let f : RN → R and f is C 2 : min f (x).

x∈RN

Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk ),

(1)

with constant α > 0. A discrete dynamical system xk+1 = g(xk ).

Question: Great but any guarantees?

Definitions Problem Let f : RN → R and f is C 2 : min f (x).

x∈RN

Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk ),

(1)

with constant α > 0. A discrete dynamical system xk+1 = g(xk ).

Question: Great but any guarantees? I

Answer: If ∇f is L-Lipschitz and α ≤ fixed points.

1 L

then GD converges to

Definitions Problem Let f : RN → R and f is C 2 : min f (x).

x∈RN

Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk ),

(1)

with constant α > 0. A discrete dynamical system xk+1 = g(xk ).

Question: Great but any guarantees? 1 L

I

Answer: If ∇f is L-Lipschitz and α ≤ fixed points.

I

Folklore: f (xk ) − f (xk+1 ) ≥

I

⇒ f is decreasing ⇒ set-wise convergence (not point-wise!).

1 2L

then GD converges to

k∇f (xk )k22 .

Definitions (cont.) Question: What if f is non-convex?

Definitions (cont.) Question: What if f is non-convex? I

Answer: The best we can hope for is convergence to local minimum!

Definitions (cont.) Question: What if f is non-convex? I

Answer: The best we can hope for is convergence to local minimum!

I

And we will have it... (under “mild” assumptions)

Important definitions I

x∗ is a critical point of f if ∇f (x∗ ) = 0 (uncountably many!).

I

x∗ is isolated if there is a U around x∗ and x∗ is the only critical point in U.

I

x∗ is a saddle point if for all U around x∗ there are y, z ∈ U such that f (z) ≤ f (x∗ ) ≤ f (y).

I

x∗ of f is a strict saddle if λmin (∇2 f (x∗ )) < 0.

I

Set S is called forward or positively invariant w.r.t h : E → RN with S ⊆ E ⊆ RN if h(S) ⊆ S.

Previous work and our results Theorem (Lee, Simchowitz, Jordan, Recht 16’) Let f : RN → R be a C 2 function, ∇f is globally L-Lipschitz and x∗ be a strict saddle. Assume that 0 < α < L1 , then Pr(lim xk = x∗ ) = 0. k

If the strict saddle points are isolated, then GD converges to saddle points with probability zero.

Previous work and our results Theorem (Lee, Simchowitz, Jordan, Recht 16’) Let f : RN → R be a C 2 function, ∇f is globally L-Lipschitz and x∗ be a strict saddle. Assume that 0 < α < L1 , then Pr(lim xk = x∗ ) = 0. k

If the strict saddle points are isolated, then GD converges to saddle points with probability zero.

Theorem (Main) 2 N Let f : S R be



C in an open convex set S ⊆ R and 2 supx∈S ∇ f (x) 2 ≤ L < ∞. If g(S) ⊆ S then the set of initial conditions x ∈ S so that gradient descent with 0 < α < 1/L converges to a strict saddle point is of (Lebesgue) measure zero, without the assumption that critical points are isolated.

Remarks and proof steps Corollary Assume furthermore that limk xk exists and let ν be a prior measure (support S) which is absolutely continuous w.r.t Lebesgue measure. Then with probability 1, GD converges to local minima.

Remarks and proof steps Corollary Assume furthermore that limk xk exists and let ν be a prior measure (support S) which is absolutely continuous w.r.t Lebesgue measure. Then with probability 1, GD converges to local minima.

Remarks Lee et al. result is generalized in two ways: I

No global Lipschitz condition.

I

Critical points do not have to be isolated.

Remarks and proof steps Corollary Assume furthermore that limk xk exists and let ν be a prior measure (support S) which is absolutely continuous w.r.t Lebesgue measure. Then with probability 1, GD converges to local minima.

Remarks Lee et al. result is generalized in two ways: I

No global Lipschitz condition.

I

Critical points do not have to be isolated.

Proof steps I

1. Convergence: Show that GD converges (already).

Remarks and proof steps Corollary Assume furthermore that limk xk exists and let ν be a prior measure (support S) which is absolutely continuous w.r.t Lebesgue measure. Then with probability 1, GD converges to local minima.

Remarks Lee et al. result is generalized in two ways: I

No global Lipschitz condition.

I

Critical points do not have to be isolated.

Proof steps I

1. Convergence: Show that GD converges (already).

I

2. Diffeomorphism: Prove that g is a diffeomorphism in S (eigenvalue analysis, show Jacobian is invertible).

Remarks and proof steps Corollary Assume furthermore that limk xk exists and let ν be a prior measure (support S) which is absolutely continuous w.r.t Lebesgue measure. Then with probability 1, GD converges to local minima.

Remarks Lee et al. result is generalized in two ways: I

No global Lipschitz condition.

I

Critical points do not have to be isolated.

Proof steps I

1. Convergence: Show that GD converges (already).

I

2. Diffeomorphism: Prove that g is a diffeomorphism in S (eigenvalue analysis, show Jacobian is invertible).

I

3. Measure zero: Use center-stable manifold along with Lindelof lemma.

Why are these technicalities important?

I

Manifold: Topological space that ”looks like” Euclidean space near each point.

I

Diffeomorphism A diffeomorphism is a map between manifolds which is continuously differentiable and has a continuously differentiable inverse. This is a useful technical smoothness condition that allows us to apply standard theorems about dynamical systems. (e.g., Center-Stable Manifold theorem).

Center-Stable Manifold theorem

Center-Stable Manifold theorem (informally) If the rule of the dynamics is a diffeomorphism then I

For every fixed point p, there exists an open ball Bp so that if trajectory q(n) is inside Bp for all n ≥ 0 then p(0) belongs to a (local) center stable manifold Wsc (p) which has dimension equal to the dimension of the space spanned by eigenvectors of the Jacobian (at p) with eigenvalues of absolute value ≤ 1.

Step 3. Measure zero

Proof Sketch of Step 3 I

Every strict saddle critical point p has a (local) center stable manifold Wsc (p) of dimension lower than N − 1), hence measure zero in RN−1

I

Consider the union of all Bp and pick a countable subcover (Lindelof’s lemma: every open cover in Rk has a countable subcover.)

I

g −1 is C 1 , maps null sets to null sets, the set of points that converge to some Bp is measure zero.

I

Countable union of measure zero sets is measure zero.

Examples - Non-isolated critical points f (x , y , z) = 2xy + 2xz − 2x − y − z, ⇒ ∇f (x , y , z) = (2y + 2z − 2, 2x − 1, 2x − 1). I

Strict saddle points correspond to the line (1/2, w , 1 − w ) for √w ∈ R (min eigenvalue is −2 2).

Therefore... Set of initial conditions in R3 so that GD converges to black line has measure zero.

Examples (cont.) - Forward invariant set x2 y4 y2 f (x , y ) = + − , Hessian J = 2 4 2 1.5

1 0 2 0 3y − 1

!

.

I

f is not globally Lipschitz (Lee et al result does not apply!)

I

Critical points are (0, 0), (0, 1), (0, −1).

I

For S = (−1,

1) × (−2,

2), sup(x ,y )∈S ∇2 f (x , y ) 2 ≤ 11 (for y = 2 maximum).

I

Choose α =

1.0

0.5

0.0

- 0.5

- 1.0

- 1.5 - 1.0

- 0.5

0.0

0.5

1.0

g(x , y ) =

1 1 12 < 11 , hence 13y y3 ( 11x 12 , 12 − 12 ) ⇒

g(S) ⊆ S

Therefore... Set of initial conditions in S so that GD converges to (0, 0) has measure zero. Start at random, then GD converges to (0, 1), (0, −1) with probability 1.

Previous and Future work

I

Vector flows perturbed by noise cannot converge to unstable fixed points [Pemantle 90’].

I

Other dynamics? Results for replicator dynamics (evolution, game theory) [Mehta, P, Piliouras 15’].

I

Mirror Descent (mirror map strongly convex). Ongoing work [Lee, P, Simchowitz, Jordan, Piliouras, Recht 16’].

I

Non-negative matrix factorization (NMF)? Ongoing work [P, Piliouras, Tetali] analyzing Lee and Seung.

I

Quantitative versions (stronger assumptions) [Ge, Huang, Jin, Yuan 15’].

I

Many more...

Thank you! Postdoc positions open! Where: Singapore

On What: Game Theory, Algorithms, Dynamical Systems [email protected]

Gradient Descent Only Converges to Minimizers: Non ...

min x∈RN f (x). Typical way; Gradient Descent (GD) xk+1 = xk − α∇f (xk),. (1) ... Page 5 ... x∗ is a critical point of f if ∇f (x∗) = 0 (uncountably many!). ▷ x∗ is ...

556KB Sizes 0 Downloads 177 Views

Recommend Documents

Functional Gradient Descent Optimization for ... - public.asu.edu
{v1,...,vp} of Vehicles Under Test (VUT). The state vector for the overall system is x ..... [2] K. Bengler, K. Dietmayer, B. Farber, M. Maurer, C. Stiller, and. H. Winner ...

Hybrid Approximate Gradient and Stochastic Descent ...
and BREACH [13] are two software toolboxes that can be used for falsification of .... when we deal with black-box systems where no analytical information about ...

cost-sensitive boosting algorithms as gradient descent
Nov 25, 2008 - aBoost can be fitted in a gradient descent optimization frame- work, which is important for analyzing and devising its pro- cedure. Cost sensitive boosting ... and most successful algorithms for pattern recognition tasks. AdaBoost [1]

cost-sensitive boosting algorithms as gradient descent
Nov 25, 2008 - on training data. ... (c) Reweight: update weights of training data w(t) i. = w(t−1) .... where ai,bi are related to the cost parameters and label infor-.

A Gradient Descent Approach for Multi-modal Biometric ...
A Gradient Descent Approach for Multi-Modal Biometric Identification. Jayanta Basak, Kiran Kate,Vivek Tyagi. IBM Research - India, India bjayanta, kirankate, [email protected]. Nalini Ratha. IBM TJ Watson Research Centre, USA [email protected]. Abst

Functional Gradient Descent Optimization for Automatic ...
Before fully or partially automated vehicles can operate ... E-mail:{etuncali, shakiba.yaghoubi, tpavlic, ... from multi-fidelity optimization so that these automated.

A Block-Based Gradient Descent Search Algorithm for ...
is proposed in this paper to perform block motion estimation in video coding. .... shoulder sequences typical in video conferencing. The NTSS adds ... Hence, we call our .... services at p x 64 kbits,” ITU-T Recommendation H.261, Mar. 1993.

Gradient Descent Efficiently Finds the Cubic ...
at most logarithmic dependence on the problem dimension. 1 Introduction. We study the .... b(1) = 0 every partial limit of gradient descent satisfies Claim 1 and is therefore the unique global minimum s, which ... The slopes in this log-log plot reve

Hybrid Approximate Gradient and Stochastic Descent for Falsification ...
able. In this section, we show that a number of system linearizations along the trajectory will help us approximate the descent directions. 1 s xo. X2dot sin.

GRADIENT IN SUBALPINE VVETLANDS
º Present address: College of Forest Resources, University of Washington, Seattle .... perimental arenas with alternative prey and sufficient habitat complexity ...... energy gain and predator avoidance under time constraints. American Naturalist ..

Provided for non-commercial research and educational use only. Not ...
of an individual's terms – are determined by proper- ties intrinsic to the ... of a natural kind share an unchanging essence, that natural kinds fall into a ... ena can be explained pragmatically (cf. Malt, 1994; ..... Wright C (eds.) A companion t

The descent 3
Page 1 of 15. Emilys hopesand fears.Assassination game bluray.73652868351 - Download The descent 3.Miss no good.Because oftheir hexagonalshapethey. tessellateso many can be placed togetheralong aslope. [IMAGE] The 1953 Floods Thefloods happened the d

An Introduction to the Conjugate Gradient Method ...
Aug 4, 1994 - Tutorial” [2], one of the best-written mathematical books I have read. .... Figure 4 illustrates the gradient vectors for Equation 3 with the constants given in ...... increases as quickly as possible outside the boxes in the illustra

Synthesis, temperature gradient interaction ...
analysis of these combs, providing directly the distribution of the number of arms on the synthesised ..... data sets are shifted vertically by factors of 10 for clarity.

Synthesis, temperature gradient interaction ...
2Department of Chemistry and Center for Integrated Molecular Systems,. Pohang .... For the normal phase temperature gradient interaction chromatography (NP-TGIC) analysis, a ..... data sets are shifted vertically by factors of 10 for clarity.

An Urban-Rural Happiness Gradient
Abstract. Data collected by the General Social Survey from 1972 to 2008 are used to confirm that in the United States, in contrast to many other parts of the world, there is a gradient of subjective wellbeing (happiness) that rises from its lowest le

MULTIPLE SOLUTIONS OF GRADIENT-TYPE ...
NJ, 2006. Babes-Bolyai University, Faculty of Mathematics and Computer Science, Kog˘alniceanu str. 1, 400084 Cluj-Napoca, Romania. E-mail address: [email protected]. Babes-Bolyai University, Faculty of Mathematics and Computer Science, Kog˘alniceanu

Conditional Gradient with Enhancement and ... - cmap - polytechnique
1000. 1500. 2000. −3. −2. −1. 0. 1. 2. 3. 4 true. CG recovery. The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components p = 2048, m = 512 Gaussian me

commoditytrademantra.com-All Roads Lead Only To Gold.pdf ...
haven gold. The gold prices ... Though other asset classes such as broad ... Displaying commoditytrademantra.com-All Roads Lead Only To Gold.pdf. Page 1 of ...

Only 8 Steps To Home.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Only 8 Steps To ...

Only Paradoxes to Offer
the world by eliminating accounts of conflict and power within them. The result, however, is well worth the effort. ..... seau or Voltaire, who had also imagined their accounts and whose genius did not protect them from ...... fran«ais"-tout in Fren

SMOOTHNESS MAXIMIZATION VIA GRADIENT ...
State Key Laboratory of Intelligent Technologies and Systems,. Department of ..... 5http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php. 18 and the ...

Smoothness Maximization via Gradient Descents
Iterate between the above two steps until convergence ... to guarantee the convergence. Also ... X. Zhu, Semi-supervied learning literature survey, Computer Sci-.

LIMITATIONS OF GRADIENT METHODS IN ...
iBP stands for BP using incremental learning, while. tBP using the trophic .... Foundation , Prentice-hall, New J ersey, second edi- tion 1999. [4] J .B. Pollack, ...