Comparison-Based Natural Gradient Optimization in ...

Viewer
Transcript

Comparison-Based Natural Gradient Optimization in High Dimension Youhei Akimoto

Faculty of Engineering, Shinshu University

[email protected]

Anne Auger

INRIA, Research centre Saclay – Île-de-France

[email protected]

ABSTRACT We propose a novel natural gradient based stochastic search algorithm, VD-CMA, for the optimization of high dimensional numerical functions. The algorithm is comparisonbased and hence invariant to monotonic transformations of the objective function. It adapts a multivariate normal distribution with a restricted covariance matrix with twice the dimension as degrees of freedom, representing an arbitrarily oriented long axis and additional axis-parallel scaling. We derive the different components of the algorithm and show linear internal time and space complexity. We find empirically that the algorithm adapts its covariance matrix to the inverse Hessian on convex-quadratic functions with an Hessian with one short axis and different scaling on the diagonal. We then evaluate VD-CMA on test functions and compare it to different methods. On functions covered by the internal model of VD-CMA and on the Rosenbrock function, VD-CMA outperforms CMA-ES (having quadratic internal time and space complexity) not only in internal complexity but also in number of function calls with increasing dimension.

Categories and Subject Descriptors G.1.6 [Numerical Analysis]: Optimization—Global optimization, Gradient methods, Unconstrained optimization; F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems

Keywords Covariance Matrix Adaptation, Natural Gradient, Hessian Matrix, Information Geometric Optimization, Theory

1.

INTRODUCTION

Natural gradients, popularized in the context of machine learning by Amari [3], have been applied with success in various contexts like training of multilayer perceptron [4], variational inference [13], reinforcement learning [15, 16] or Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. GECCO’14, July 12–16, 2014, Vancouver, BC, Canada. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2662-9/14/07 ...$15.00. http://dx.doi.org/10.1145/2576768.2598258.

Nikolaus Hansen

INRIA, Research centre Saclay – Île-de-France

[email protected]

black-box optimization [14, 21]. In the setting of black-box optimization, given a family of probability distributions Pθ , a stochastic search algorithm is defined by an iterative update on Pθ . (This update should lead to convergence of Pθ towards a Dirac-delta distribution concentrated on optima of f .) The natural gradient can be elegantly used to define this update. Indeed, the original minimization problem of finding arg minx∈Rd f (x) can be transformed into a joint optimization problem defined on Θ equipped with the Fisher metric. The joint problem can simply be minimization of the expectation of f over Pθ [7,21]. However this latter criterion is not invariant under monotonic transformations of f and can easily lead to unstable behavior. In order to achieve an invariant algorithm, a better choiceR for the joint problem is arg maxθ∈Θ J(θ) with J(θ) := x∈Rd W ◦ f (x) dPθ , where W ◦ f is a θ-dependent rank-preserving transformation of f . The joint problem J is strictly comparison-based (or solely based on the f -ranks) and therefore invariant to monotonic transformations of f (see below for the precise definition) [5]. The natural gradient of J(θ), estimated by Monte Carlo samples from Pθ , governs the update of θ. In contrast to some settings in machine learning where it is required to estimate the Fisher information matrix (FIM) and perform a numerical inversion in order to compute the natural gradient (e.g. for multi-layer perceptrons, see [4] for a discussion), in our context the inverse FIM can be explicitly derived provided a statistical model with known FIM is used. Consequently, efficient comparison-based optimization algorithms using natural gradients can be derived [1,5,7]. Interestingly, the covariance matrix adaptation evolution strategy (CMA-ES) [9, 10, 20]—a state-of-the-art evolutionary algorithm for continuous optimization—derives from the natural gradient framework sketched above when using the family of Gaussian distributions [1, 5]. The CMAES algorithm was however introduced independently of the natural gradient approach. The CMA-ES parametrizes a Gaussian model with full covariance matrix, i.e. θ encodes a mean vector and full covariance matrix. It achieves invariance to general linear transformation of the search space. However in consequence, its space complexity and its internal time complexity per f -call are quadratic (see details below). For optimizing functions in higher dimensions, quadratic scaling becomes quickly too time consuming and linear scaling is desirable. To achieve linear scaling, several “variants” of CMA-ES have been proposed compromising on the general invariance to linear transformation: sep-CMA [18] restricts C to a diagonal matrix, MVA [17] and R1-NES [19] parameterize C by

I + vv T . We propose here a novel comparison-based algorithm with linear time and space complexity derived from the natural gradient. Instead of a full covariance matrix, we parametrize the covariance matrix as D(I + vv T )D where D is a diagonal matrix of size d and v a vector in Rd . The parameter update follows the natural gradient and two further techniques from CMA-ES are borrowed: an evolution path that low pass filters the change of the distribution mean (additionally used for the natural gradient update) and cumulative step-size adaptation, based on a similar idea. Both mechanisms enhance the performance of CMA-ES considerably [8]. The reminder of the paper is organized as follows. Section 2 is devoted to the introduction to the IGO framework and the CMA-ES. In section 3, we derive the novel linear time and space algorithm from the IGO framework combined with the cumulation concept borrowed from the CMA-ES. The proposed method is called VD-CMA. In section 4, we compare VD-CMA with the CMA-ES on a standard benchmark testbed, both in terms of function evaluations and cpu time. It is also compared with other linear time variants of evolution strategies. We conclude this paper with summary and further discussion in Section 5.

2.

INFORMATION GEOMETRIC OPTIMIZATION AND CMA-ES

Information Geometric Optimization (IGO) is a general framework for optimization in arbitrary search spaces. IGO is based on invariance and consequently leads to comparison and natural gradient based optimization algorithms [5]. We give now some background about IGO, explain how parts of the CMA-ES algorithm are instantiations of IGO on the family of Gaussian distributions and detail other important concepts of CMA-ES that we borrow to construct our novel algorithm. While the IGO framework applies to arbitrary search spaces, we describe it conveniently on Rd .

2.1 Information Geometric Optimization Given an objective function to be minimized, f : Rd 7→ R, and a family of probability distributions, Pθ on Rd with θ ∈ Θ, equipped with the Fisher metric, a joint optimization problem is defined on Θ as the maximization of the expectation J(θ) of a nonlinear scaling W of the objective function f over Pθ . More specifically, we consider at a given iteration t the current parameter θt and define Wθft (x) := w qθft (f (x)) , where w : [0, 1] → R is a non-increasing function and qθft (f¯) is the probability of sampling a point from the current distribution given θt into the sub R level set {x ∈ Rd : f (x) ≤ f¯}, defined as qθft (f¯) := f (x)≤f¯ pθt (x) dx. Hence, the function J also depends on θR (denoted by Jθt in the sequel) and is defined as Jθt (θ) := Rd Wθft (x)pθ (x) dx. This q-quantile based transformation of f is invariant under monotonic transformations of f and leads to comparisonbased algorithms that show the same performance on f as on any increasing transformation of f , e.g., f (x) = akxkb +c is equivalent for all a, b > 0 and c ∈ R. The IGO update consists in the gradient ascent step θ

t+δt

˜ θt (θ)|θ=θt = θ + δt∇J t

(1)

˜ θt (θ) is the natural gradient of Jθt given by the where ∇J product of the inverse of the FIM1 I −1 (θ) and the vanilla gradient ∇Jθt (θ). The gradient ∇Jθt (θ) is computed with the log-likelihood trick R (2) ∇Jθt (θ) = Rd Wθft (x) ∇ ln pθ (x) pθ (x) dx .

The update (1) requires to evaluate the integral (2). This integral is naturally estimated by a Monte-Carlo method with λ samples drawn from the current distribution Pθt . Given a set of λ independent samples xi ∼ Pθt , for i = 1, . . . , λ, the quantile function qθft (f (xi )) is approximated by (rk(xi )+1/2)/λ, with rk(xi ) the ranking of f (xi ) among the λ samples, namely, rk(xi ) := |{j : f (xj ) ≤ f (xi )}|. Hence Wθft (xi ) is approximated by wrk(xi ) := w((rk(xi ) + 1/2)/λ). With wi := w((i + 1/2)/λ)/λ, the IGO algorithm update reads P θt+δt − θt = δtI −1 (θt ) λi=1 wrk(xi ) ∇ ln pθt (xi ) P = δtI −1 (θt ) λi=1 wi ∇ ln pθt (xi:λ ) , (3) where xi:λ denotes the ith best point ranked according to f .

2.2 The CMA-ES Algorithm The CMA-ES algorithm [9, 10, 20], considered as state-ofthe-art method for stochastic numerical optimization, was recently found to derive from the natural gradient, and more precisely from the IGO framework described in the previous section [1, 5]. For CMA-ES, Pθ is the family of Gaussian distributions N (m, σ 2 C) parametrized with a mean vector m ∈ Rd and a covariance matrix σ 2 C, with σ ∈ R> the so-called global step-size, and C ∈ Rd×d a positive-definite symmetric matrix. Because Gaussian distributions are considered, the FIM and its inverse are well known and the IGO update (1) can be computed analytically [1]. The update of θ = (m, σ, C) in CMA-ES combines different ideas. The update of m and C uses the natural gradient as prescribed by the IGO algorithm (3), however also using the so-called cumulation concept that smoothens and accelerates this update without compromising its stability [8]. Then, different learning rates (corresponding to the step-size δt of the gradient ascent step) for the mean and the covariance matrix updates are used. Last, the global step-size σ is independently adapted in order to accelerate the search performance and prevent premature convergence. We give in the sequel a compact but thorough definition of the CMA-ES algorithm. After the initialization of m, σ and C = I and so-called evolution paths pσ = pC = 0 ∈ Rd , the CMA-ES repeats the following steps until a termination criterion is satisfied. Step Compute the square √ √ 1. Matrix Decomposition. root C of C, where √ √C is symmetric and positive-definite, and satisfies C = C C. Step 2. Sampling, Evaluation and Ranking. Sample λ candidate solutions xi ∼ N (m, σ 2 C), for i ∈ J1, λK, as follows. Generate d-variate standard √ normal random vectors zi ∼ N (0, I) and compute xi = m+σ Czi . For the later use we keep {zi } as well as {xi }. Then, evaluate their objective values f (xi ) for all i ∈ J1, λK. Rank the solutions according to f . In the following steps the subscript i : λ denotes the index of ith best solution among λ current samples. 1 RThe Fisher information Tmatrix, FIM, is defined as I(θ) = ∇ ln pθ (x)(∇ ln pθ (x)) pθ (x) dx. Rd

Step 3. Cumulation. Update the evolution paths pσ and pC as p P pσ ← (1 − cσ )pσ + cσ (2 − cσ )µeff µi=1 wi zi:λ √ P h c (2−c )µ pC ← (1 − cc )pC + σ c σ c eff µi=1 wi (xi:λ − m).

Here cσ and cc are the inverses of the backward Pµ time2 horizons −1 for pσ and pC , respectively, and µeff = . We i=1 wi 2 2t let hσ = 1 if kpσ k /d < (2 + 4/(d + 1))(1 − (1 − cσ ) ), where t is the iteration number starting from one, and hσ = 0 otherwise. Step 4. Update Parameters. Compute the natural gradient and update the parameters as follows: P ˜ m ln pθ (xi:λ ), m ← m + cm µi=1 wi ∇ σ ← σ exp ((cσ /dσ ) (kpσ k/χd − 1)) , C ← C + (1 − hσ )c1 cc (2 − cc )C | {z } make up for variance loss in case of hσ = 1 Pµ ˜ C ln pθ (m + σpC ) . ˜ C ln pθ (xi:λ ) +c1 ∇ + cµ i=1 wi ∇ | {z } {z } | rank-one update

rank-(µ∧d) update

Here cm is the learning rate for the m update, cµ and c1 are the learning rates of the so-called rank-µ and rank-one updates for C. The damping parameter for the σ update is denoted by dσ . √ The symbol χd denotes the √ expected value of kN (0, I)k = 2Γ((d + 1)/2)/Γ(d/2) ≈ d(1 − 1/(4d) + 1/(21d2 )). The approximated value is used in the algorithm. ˜ = (∇ ˜ m, ∇ ˜ C ) of the log-likelihood of The natural gradient ∇ pθ w.r.t. (m, C) is computed while σ is considered to be fixed. When σ is fixed, the FIM I(θ), where the parameter vector is θ = [mT , vec(C)T ]T , becomes a block diagonal matrix diag(Im , IC ) [2]. The diagonal blocks of I are given by Im = σ −2 C −1 and IC = 2−1 (C −1 ⊗ C −1 ), where ⊗ denotes the Kronecker product. The vanilla gradients of the log-likelihood w.r.t. m and C are respectively ∇m ln pθ (x) = σ −2 C −1 (x − m) and ∇C ln pθ (x) = 2−1 C −1 σ −2 (x − m)(x − m)T − C C −1 . Multiplying the inverse of Im and IC with the gradient ∇m ln pθ (x) and vec ∇C ln pθ (x) , we have the ˜ m ln pθ (x) = (x − m) and ∇ ˜ C ln pθ (x) = natural gradient ∇ σ −2 (x − m)(x − m)T − C. Therefore, the update equations for m and C read P C ← C + cµ µi=1 wi σ −2 (xi:λ − m)(xi:λ − m)T − C + c1 pC pT C −C , P m ← m + cm µi=1 wi (xi:λ − m) .

The resulting m- and C-updates are similar to those used in the cross-entropy method (CEM) for continuous optimization [6] if σ = 1 and c1 = 0, except that in CEM the m is updated first, which invariably leads to smaller variances in C and aggravates the problem of premature convergence. The constants appearing in the algorithm are summarized in the following [8]. λ = 4 + ⌊3 ln(d)⌋, wi =

µ = ⌊λ/2⌋,

ln((λ+1)/2)−ln(i) Pµ , i=1 (ln((λ+1)/2)−ln(i))

dσ = 1 + cσ + 2 max(0, cc =

4+µeff /d , d+4+2µeff /d

cµ = min 1 − c1 ,

cσ =

q

c1 =

µeff −1 d+1

cm = 1, µeff +2 , d+µeff +5

− 1),

2 (d+1.3)2 +µeff

2(µeff −2+1/µeff ) (d+2)2 +µeff

, .

(4)

The internal time complexity is O(d3 ) for Step 1, O(λd2 ) for Step 2, O(µd) for Step 3 and O(µd2 ) for Step 4. In practice, the matrix decomposition (Step 1) is done every ⌈(10d(c1 +cµ ))−1 ⌉ iterations reducing the internal time complexity to O(d2 ) per f -call. The space complexity is O(d2 + µd).

3. VD-CMA: A LINEAR VARIANT OF CMAES FOR HIGH DIMENSION OPTIMIZATION We derive in this section a novel comparison-based algorithm using the IGO framework and additional features of CMA-ES. The algorithm aims at optimizing functions in high dimensions and should thus scale linearly with dimension d for its internal time (per f -call) and memory requirements. For this purpose we restrict the covariance matrix of the Gaussian model {N (m, σ 2 C)} on Rd such that C has only 2d components to be adapted. More specifically, C is written in the form C = D(I + vv T )D ,

(5)

where D is a diagonal matrix of dimension d and v is a vector in Rd . This model is able to represent a scaling for each variable by D and a principle component, which is generally not parallel to an axis, by Dv. We parameterize the model by θ = (m ∈ Rd , σ ∈ R, θC ∈ R2d ) where θC is composed of two parts: θD ∈ Rd whose ith element is the ith diagonal element of D, and v ∈ Rd .

3.1 Preliminaries To derive the parameter update equation for the model based on the natural gradient, we first derive the gradient of the log-likelihood ln pθ (x) and the FIM of the model. When computing the gradient and the FIM, we suppose that σ is fixed and the parameter vector considered is θ = T T [mT , v T , θD ] . Notations. Let V be the diagonal matrix whose ith diagonal element is the ithe component of v. The normalization of v by its Euclidean norm is denoted by v = v/kvk. Analogously, V = V /kvk. The determinant of I + vv T , namely 1+kvk2 , is denoted γv . Let ⊙ denote the element-wise product operator and v = v ⊙ v. The vector whose elements are all one is denoted by 1 for any dimension. The vector ei is the unit vector whose ith component is one and the others are zero. Lemma 3.1. Let x ∈ Rd and let y = σ −1 D−1 (x − m). The gradients of the log-likelihood of our model w.r.t. m, v and θD are ∇m ln pθ (x) = σ −2 C −1 (x − m), ∇θD ln pθ (x) = D−1 y ⊙ y − γv−1 hy, vi y ⊙ v − 1 ∇v ln pθ (x) = γv−1 hy, vi y − γv−1 (hy, vi2 + γv )v .

Proof idea. It is known from Eq. (20) in [2] that the partial derivative of the log-likelihood of the Gaussian model {N (m, Σ)} parameterized by θ given x w.r.t. θi is computed as ∂ ln pθ (x) ∂θi

= +

∂mT ∂θi 1 2

Σ−1 (x − m)

∂Σ −1 Tr Σ−1 ∂θ Σ (x − m)(x − m)T − Σ , i

where Tr denotes the trace. Substituting the partial derivatives and simplifying the above equality, we have the gradients. The details are omitted due to the space limitation. Lemma 3.2. The Fisher information matrix of our model −2 −1 is a block diagonal matrix h i diag(Im , IC ), where Im = σ C Iv,v Iv,D T and IC = ID,v ID,D , where Iv,D = ID,v and Iv,v = γv−1 kvk2 I + (1 − kvk2 )γv−1 vv T , ID,v = γv−1 D−1 V (2 + kvk2 )I − vv T , ID,D = γv−1 D−1 2γv I + kvk2 V 2 − V vv T V D−1 .

Proof idea. It is a well-know fact that the ijth element [I]i,j of the FIM for the Gaussian distribution N (m, Σ) with T ∂Σ −1 ∂Σ Σ−1 ∂m + 12 Tr Σ−1 ∂θ Σ ∂θj . parameter θ is [I]i,j = ∂m ∂θi ∂θj i

T T Remember θ = [mT , v T , θD ] and Σ = σ 2 C = σ 2 D(I + vv T )D in our case. The partial derivative ∂m/∂θi = ei if θi = mi and ∂m/∂θi = 0 otherwise. On the other hand, ∂Σ/∂mi = 0. Therefore, the first term in the above equality appears only if both θi and θj are the elements of m, and the second term appears only if neither θi or θj is the element of m. This proves the block diagonal property of I and Im = Σ−1 = σ −2 C −1 is an immediate consequence. The rests are derived from the second term by substituting the partial derivatives w.r.t. v and θD and simplifying it.

To compute the natural gradient, the FIM must be invert−1 −1 ible. Since I is block diagonal, the inverse is diag(Im , IC ) if both Im and IC are nonsingular. By Lemma 3.2 we know that Im = (σ 2 C)−1 and it is nonsingular as long as C is nonsingular. The partitioned matrix inversion formula described in Theorem 8.5.11 in [12] shows that IC is nonsingular if and only if the Schur complement Sv,v = ID,D − −1 ID,v Iv,v Iv,D of Iv,v is nonsingular, where the invertibility of Iv,v is guaranteed by the Sherman-Morrison formula (e.g., Corollary 18.2.10 in [12]) as long as v 6= 0. The Schur complement Sv,v is given in the following lemma, whose proof is straight forward from the partitioned matrix inversion formula, Sherman-Morrison formula and Lemma 3.2. Lemma 3.3. The Schur complement of Iv,v in IC ap2 T peared in Lemma 3.2 is Sv,v = 2D−1 [I − 2V + vv ]D−1 . Unfortunately, Sv,v becomes singular for some v. For example, v = ei for any i is a typical case that causes the singularity of Sv,v . Moreover, Sv,v becomes singular for any v (hence for any v) in the case of d = 2. Indeed, noting that kvk2 = 1, it is easy to see the determinant of Sv,v is zero. Thus, the invertibility of IC is not guaranteed. Theoretically one can use the pseudo inverse of the FIM to define the natural gradient everywhere. However, due to arbitrarily small eigenvalues of the FIM around singular points that lead to arbitrarily long natural gradients, the resulting parameter update becomes unstable when the parameter approaches a singular point. Therefore it does not essentially solve this difficulty.

the block diagonalized FIM is nonsingular since Iv,v and ID,D are nonsingular according to the Scherman-Morrison formula, provided v 6= 0 and D is positive definite. However, this leads to poor behavior, for example, on the rotated Cigar function defined in Table 1, where the principle component of the covariance matrix should not be axis parallel and v has to be learned adequately. To get nonsingularity without significantly compromising the performance, we use (α) diag(Im , IC ) in place of diag(Im , IC ), where Iv,v αIv,D (α) IC = , α ∈ [0, 1] . (6) αID,v ID,D (α)

(α)

If α = 1, IC is the original FIM, if α = 0, IC is block (α) diagonal, and α must be tuned such that IC remains nonsingular for any v and D, with α = 1 desired. Since Iv,v is nonsingular except for v = 0, according to the partitioned (α) matrix inversion formula IC is invertible if and only if the −1 Iv,D of Iv,v is Schur complement Sv,v = ID,D − α2 ID,v Iv,v nonsingular. Lemma 3.4. The Schur complement of Iv,v in (6) is Sv,v = T D−1 (A + bvv )D−1 , where b = −(1 − α2 )kvk4 γv−1 + 2α2 and 2 A = 2I − (b + 2α2 )V . The proof is straightforward from Lemma 3.2 and SchermanMorrison formula. Although the necessary and sufficient condition for Sv,v to be nonsingular is not provided, the next proposition shows a sufficient α which leads to numerically −1 stable Sv,v . −1/2

. If we choose Proposition 3.5. Let γ = γv 1/2 4 v / maxi (v i )] α = min 1, [kvk +(2−γ)γ , 2+kvk2

(7)

then Sv,v is nonsingular and its inverse is given by T T −1 Sv,v = D A−1 − (1 + bv A−1 v)−1 bA−1 vv A−1 D .

Proof idea. According to Lemma 3.4 and SchermanMorrison formula, Sv,v is invertible if A is nonsingular and T 1 + bv A−1 v 6= 0. Then, Scherman-Morrison formula pro−1 vides the above explicit form of Sv,v . Thus, it suffices to T show the nonsingularity of A and 1 + bv A−1 v 6= 0 under T condition (7). Indeed, we prove Ai,i ≥ γ and 1 + bv A−1 v ≥ γ. The necessary and sufficient condition for Ai,i ≥ γ is α2 ≤ [kvk4 +(2−γ)(1+kvk2 )/ max(v j )]/(2+kvk2 )2 . If this condiP P 2 T tion holds, we have 0 < v A−1 v ≤ γ −1 i v i ≤ γ −1 i v i = T γ −1 . Noting that γ ≤ 1, we have that 1+bv A−1 v ≥ γ holds if b ≥ γ(γ−1). The necessary and sufficient condition for b ≥ γ(γ −1) is α2 ≥ [kvk4 −(1−γ)γ(1+kvk2 )]/(kvk4 +2kvk2 +2). The α in (7) satisfies both inequalities.

3.2 Modified Fisher Information Matrix with Reduced Off-diagonal Blocks

The following theorem provides O(d) computation of the modified natural gradient of the log-likelihood with α introduced in Proposition 3.5.

A simple way to avoid singularity of the FIM is to restrict it to the principal diagonal blocks. Then, the natural gradient for each block of parameters is computed independently while the other parameters are fixed. In our case,

Theorem 3.6. The modified natural gradient is computed as follows. Let y = D−1 (x − m)/σ as above. Let α, A and b be those which appear in Proposition 3.5 and Lemma 3.4.

Table 1: Test function definitions. R is an orthogonal matrix and u is a unit vector, both are randomly i−1 generated for each run; Dell is a diagonal matrix whose ith diagonal element is 103 d−1 . The global minimum point is located at 1, RT 1 and 0 for fros , frosrot and all other functions, respectively. P P Sphere fsph (x) = di=1 x2i Tablet ftab (x) = 106 x21 + di=2 x2i 2 6 Pd 2 Ellipsoid fell (x) = fsph (Dell x) Cigar fcig (x) = x1 + 10 i=2 xi 6 Ellipsoid-Cigar fellcig (x) = fcigrot (Dell x) Rot-Tablet ftabrot (x) = fsph (x) + (10 − 1) hx, ui2 Rot-Ellipsoid fellrot (x) = fell (Rx) Rot-Cigar fcigrot (x) = 106 fsph (x) + (1 − 106 ) hx, ui2 Pd−1 2 2 Rot-Rosenbrock frosrot (x) = fros (Rx) Rosenbrock fros (x) = i=1 10 (xi − xi+1 )2 +(xi − 1)2 Compute s and t as follows. 1. 2. 3. 4. 5.

2

s ← y ⊙ y − kvk

−1

t ← hy, vi y − 2

hy, vi γv−1 y 2

⊙v−1

(hy, vi + γv )v

(2 + kvk2 )v ⊙ t − kvk2 hv, ti v

−1

s ← A−1 s − 1 + b v, A−1 v b s, A−1 v A−1 v

t ← t − α (2 + kvk2 )v ⊙ s − s, v v s←s−

αγv−1

˜ m ln pθ (x) = x − m, ∇ ˜ v ln pθ (x) = kvk−1 t and Then, ∇ ˜ ∇θD ln pθ (x) = Ds. Proof idea. The modified natural gradient is the prod(α) uct of the inverse of diag(Im , IC ) and the vanilla gra(α) dient given in Lemma 3.1. The inverse of diag(Im , IC ) (α) −1 is diag(Im , (IC )−1 ). As is shown in Lemma 3.2, Im = −1 σ −2 C −1 and its inverse is Im = σ 2 C. Premultiplying the −1 ˜ m ln pθ (x) = vanilla gradient w.r.t. m by Im , we have ∇ (α) x − m. The inverse of IC with α in Proposition 3.5 can be computed by using the partitioned matrix inversion formula as −1 Iv,v αIv,D αID,v ID,D −1 −1 −αIv,v Iv,D −1 I 0 −1 Sv,v −αID,v Iv,v I + = v,v I 0 0

−1 The inverse Iv,v can be computed by the Scherman-Morrison −1 formula and Sv,v is given in Proposition 3.5. Substituting them into the above equality and premultiplying the vanilla gradient given in Lemma 3.1 by the right hand side of the above equality, we obtain the natural gradient w.r.t. v and θD . The steps 2 to 7 are just a decomposition of the computation and can be computed in O(d).

3.3 The VD-CMA Algorithm The overall algorithm is as follows. Initialize m and σ depending on the problem search space. Initialize D = I and v ∼ N (0, I/d) and pC = pσ = 0. Step 1. Sampling, Evaluation and Ranking. Sample λ candidate solutions xi , for i ∈ J1, λK, as follows. Generate d-variate standard pnormal random vectors zi ∼ N (0, I), compute yi = zi + ( 1 + kvk2 − 1) hzi , vi v and xi = m + σDyi . Then, yi ∼ N (0, I + vv T ) and xi ∼ N (m, σ 2 D(I + vv T )D). Evaluate their objective values f (xi ) for all i ∈ J1, λK. Rank solutions according to f . Step 2. Cumulation. The evolution paths pσ and pC are updated in the same way as the original CMA does (see Sec. 2.2). Note that zi = C −1/2 (xi − m)/σ in the original CMA whereas zi = (I + vv T )−1/2 D−1 (xi − m)/σ 6=

C −1/2 (xi −m)/σ in our case. We use zi rather than C −1/2 (xi − m)/σ in order to achieve linear time update. Then, compute hσ in the same way as in the original CMA. Step 3. Update Parameters. Update the parameters as follows: P ˜ m ln pθ (xi:λ ), m ← m + cm µi=1 wi ∇ σ ← σ exp ((cσ /dσ ) (kpσ k/χd − 1)) , P ˜ v ln pθ (xi:λ ) v ← v + cµ µ wi ∇ i=1

˜ v ln pθ (m + σpC ), + hσ c1 ∇ Pµ ˜ D ln pθ (xi:λ ) D ← D + cµ i=1 wi ∇ ˜ + hσ c1 ∇D ln pθ (m + σpC ).

Here the natural gradients are computed following Theorem 3.6. The (constant) parameters are taken from (4) except for cσ , c1 and cµ . Since the degrees of freedom in the covariance matrix is 2d compared with d(d + 1)/2 in the original CMA, we expect that the natural gradient estimate is more reliable and larger values for c1 and cµ can be taken. The learning rate cσ for σ is modified as well to achieve better scale-up with d. Let cold and cold be the settings given 1 µ in (4). Then, we set √ µeff d−5 old cσ = 2(√d+√ , c1 = d−5 cold . 1 , cµ = min 1 − c1 , 6 cµ 6 µ ) eff

The internal space complexity decreases to O(µd), the internal time complexity for each objective function call to O(d) compared to O(µd + d2 ) and O(d2 ) in the standard CMA, respectively.

4. EXPERIMENTS We evaluated the new VD-CMA on benchmark functions described in Table 1, where the number of variables varies from 10 to 104 . The VD-CMA is comparison-based and has the same performance on any composition of the functions by a monotonic transformation. The initial mean vector obeys N (3 · 1, 22 · I), except for fros (x) and frosrot (x) where m obeys N (0, 22 · I) to avoid the symmetry; the initial stepsize is σ = 2. Runs are terminated as successful when a function value better than 10−10 is reached, otherwise when the number of function evaluations reached 105 · d. Only on the Rosenbrock functions about 15% of the runs were not successful due to the local minima and these are disregarded in the presentation. The code is implemented in Octave (single thread) and run on Debian 6.0 machine with Intel(R) Core(TM) i7-3770 3.4GHz CPU and 16 GB RAM. Figure 1 shows a typical result on the 50 dimensional fellcig function, for which the inverse Hessian is proportional −1 −1 to Dell (I + (106 − 1)uuT )Dell . First, D becomes propor-

400 10

10

1

v

mini f(xi)

10

300

0

10

5

10

σ

−5

10

103/sqrt(d)≈ 141

200

α

0

10

−1

10

−2

100

10

0

10

−3

−10

10

0

1

2

3

No. Func. Evals.

D

0

1

2

3

No. Func.Evals.

4

x 10

0

1

2

3

No. Func. Evals.

4

x 10

4

x 10

√ √ Figure 1: Single run on 50 dimensional fellcig with cigar axis u = (1/ d, . . . , 1/ d). The minimum f -value at each iteration, α, σ (left), v values (center), and diagonal elements of D (right) are plotted.

(b) Ratio of FEs

FEs / Dim

(a) 104

103

100

10-1

102

Ratio of CPU

101

CPU / Dim

sph tab cig cigrot ell ellcig ros rosrot

100 10-1 10-2

100

10-1

10-2 10

1

10

2

10

3

10

4

Dimension

101

102

Dimension

Figure 2: (a) Number of function evaluations (FEs) and CPU time [s] divided by d. (b) FEs and CPU time spent by VD-CMA divided by those spent by the CMA. Shown are average and standard deviation (error bar), from 10 independent runs. −1 tional to Dell , then v starts to adapt, ending up with vi ≈ p (106 − 1)/d, such that C = D(I + vv T )D finally becomes closely proportional to the inverse Hessian. The reason of the later adaptation of v is that the function value is less sensitive to the direction of u than the coordinate-wise scaling −1 by Dell and the selection bias is only visible after D ≈ Dell . After learning the inverse Hessian, the speed of convergence is as fast as on fsph . On the 50 dimensional fellcig , α is almost always one, whereas we observed a smaller α on lower dimensional functions (e.g., varying in [0.75, 1] on the 10 dimensional fellcig and in [0.65, 1] on the 10 dimensional fcig ). Figure 2.(a) shows the number of function evaluations and CPU time in seconds, averaged over 10 independent runs on eight functions. The number of function evaluations scale up linearly on fsph , ftab , fcig , fcigrot , and slightly more than linear on fell , fellcig . On the Rosenbrock functions fros and frosrot the scale up is close to quadratic. On the other hand, it took 1.6e7, 1.0e7, 1.2e7 function evaluations on ftabrot for d = 10, 20, 50, and 7.1e6, 1.3e7, 2.0e7 function evaluations

on fellrot for d = 10, 20, 50, respectively, whereas the CMA requires 6.0e3, 1.6e4, 6.6e4 on ftabrot and 6.2e3, 1.9e4, 1.0e5 function evaluations on fellrot . The inverse Hessian of these functions can not be well approximated by (5). Since the time complexity for each function evaluation is O(d), the total CPU time scales d times more than the number of function evaluations does. Figure 2.(b) shows the speed up over the standard CMA. The reason of the speed up in terms of the number of function evaluations is the (d−5)/6 times larger learning rate for the C update. The effect is more pronounced on fell , fellcig , ftab than on fcig and fcigrot , although they all have a Hessian matrix whose inverse can be represented by (5). This is because the standard CMA excels at learning single long components of C because of the cumulation in pC . On fsph , the performance does not differ much from standard CMA since C does not need to learn the shape and the σ update is dominative in determining the speed of convergence. When

Table 2: Average number of function evaluations to reach the target function value 10−9 on 20 dimensional functions among three runs except for MVA, where the average is computed over 70 runs. Data is taken from the references. Standard deviations are smaller than 3% of the average numbers except for MVA. On fros , no success was observed for MVA in 3.5 × 105 function evaluations. fell fros

VD-CMA 9.4 × 103 2.0 × 104

CMA 2.0 × 104 2.1 × 104

(1, 10)-AII 1.2 × 104 2.1 × 104

(1, 10)-MVA no success 5.7 × 104

it comes to the total CPU time, our approach improves over the standard CMA for d ≥ 50. Figure 3 shows the scale up of VD-CMA, sep-CMA [18] and R1-NES [19], which are all linear time and space algorithms based on the same natural gradient principle.2 Table 2 shows a comparison with the standard CMA, (1, 10)AII [11] and (1, 10)-MVA [17]. The reason of better scale up of VD-CMA and sep-CMA than R1-NES even on fsph is the cumulation employed to adapt σ. On fcig and fcigrot VDCMA is more efficient than R1-NES, though both Gaussian models maintained by VD-CMA and R1-NES can adapt to the inverse Hessian. This is the effect of the cumulation for covariance adaptation. On fell , sep-CMA is faster than VD-CMA simply because the learning rate for the C update is higher. On the other hand, MVA and R1-NES, both of which restrict the covariance matrix to maintain only one long direction, do not solve fell , fellcig and ftab . Since the model in our approach is richer than those maintained in sep-CMA and R1-NES, VD-CMA can solve more efficiently a larger class of functions including fellcig , fros and frosrot that are ill-conditioned and non-separable.

5.

DISCUSSION

Based on the IGO framework, we have derived a comparisonbased stochastic search algorithm, VD-CMA, for continuous optimization in high dimension. To achieve internal computational time and space complexity linear in dimension, we have restricted the covariance matrix of the Gaussian distribution to D(I + vv T )D. Since this model has singular points where the natural gradient is not defined and leads to unstable parameter update, we have defined a modified FIM to avoid the singularity and enable numerically stable computation of the natural gradient without significantly compromising the performance. Additionally to the natural gradient, we have incorporated the cumulation concept from the CMA-ES to make robust and accelerate the method. We have shown the advantage of VD-CMA over the standard CMA in two aspects. One is the linear scaling of the internal time and memory usage w.r.t. the number of variables that is desired for optimizing functions in high dimension. The other is better scaling of the number of function evaluations thanks to the higher learning rates for the covariance update. The second aspect implies VD-CMA outperforms the standard CMA even on a low or moderate dimensional 2

The sep-CMA was introduced independently of the natural gradient, but later it was found in [2] to derive from the natural gradient framework.

function where the inverse of the Hessian can be approximated by (5). On the other hand, if the model does not suit the inverse Hessian of a function, e.g., fellrot and ftabrot , VDCMA is inefficient compared to the standard-CMA. Compared to other linear time variants of the CMA-ES, it can solve a wider class of functions since we have a richer but linear number of elements of the covariance matrix. We end with a remark on parallelization. As well as most evolutionary algorithms, one can benefit from parallelization. For the sake of simplicity we assume that λ processors are available. Then sampling and evaluation for each solution are performed in parallel. Moreover, step 2 and step 3 in Section 3.3 are parallelizable by computing ˜ m ln pθ (xi:λ ), wi ∇ ˜ v ln pθ (xi:λ ) wi zi:λ , wi (xi:λ − m) = wi ∇ ˜ D ln pθ (xi:λ ) for each xi:λ in parallel. Then, the and wi ∇ number of floating point multiplications at each iteration on each processor reduces from O(λd) to O(d). The other computation required is the sorting of O(λ) floating point numbers for ranking and the sum of O(µ) floating point numbers for update, both of which are relatively cheap and can be also parallelized if needed. To further reduce the runtime, one can set the population size, λ, larger than the default value, which typically requires more function evaluations but smaller number of iterations. Large population also helps when the objective function is rugged.

Acknowledgements. This work was supported by the ANR-2010-COSI-002 (SIMINOLE) and ANR-2012-MONU-0009 (NumBBO) grants of the French National Research Agency.

6. REFERENCES [1] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Bidirectional Relation between CMA Evolution Strategies and Natural Evolution Strategies. In Parallel Problem Solving from Nature – PPSN XI, pages 154–163, 2010. [2] Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Theoretical Foundation for CMA-ES from Information Geometry Perspective. Algorithmica, 64:698–716, 2012. [3] S. Amari. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251–276, 1998. [4] S. Amari, H. Park, and K. Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12:1399–1409, 2000. [5] Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles. arXiv:1106.3708, 2011. [6] P.-T. D. Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein. A Tutorial on the Cross-Entropy Method. Annals of Operations Research, (134):19–67, 2005. [7] T. Glasmachers, T. Schaul, Y. Sun, D. Wierstra, and J. Schmidhuber. Exponential Natural Evolution Strategies. In Proceedings of Genetic and Evolutionary Computation Conference, pages 393–400, 2010. [8] N. Hansen and A. Auger. Principled Design of Continuous Stochastic Search: From Theory to Practice. In Y. Borenstein and A. Moraglio, editors, Theory and Principled Methods for the Design of Metaheuristics. Springer, 2013.

Tablet

Sphere 107

Cigar

Rotated Cigar

VD-CMA Sep-CMA R1-NES

106

105

Function Evaluations

104

103

101

102

103 101

102

103 101

103 101

Rosenbrock

Ellipsoid Cigar

Ellipsoid

102

102

103

Rotated Rosenbrock

107

106

105

104

103

101

102

103 101

102

103 101

102

103 101

102

103

Dimension

Figure 3: Number of function evaluations spent by VD-CMA, Sep-CMA and R1-NES. Shown are average and standard deviation (error bar), from 10 independe nt runs. Missing data implies it failed to reach the target within 2 · 107 function evaluations. Note that the error bars are hardly visible since the std. are small. [9] N. Hansen, S. D. Muller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1):1–18, 2003. [10] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001. [11] N. Hansen, A. Ostermeier, and A. Gawelczyk. On the Adaptation of Arbitrary Normal Mutation Distributions in Evolution Strategies: The Generating set Adaptation. In Proceedings of the Sixth International Conference on Genetic Algorithms, pages 57–64, 1995. [12] D. A. Harville. Matrix Algebra from a Statistician’s Perspective. Springer-Verlag, 2008. [13] A. Honkela, M. Tornio, T. Raiko, and J. Karhunen. Natural conjugate gradient in variational inference. In Neural Information Processing, 14th International Conference, ICONIP 2007, pages 305–314, 2008. [14] L. Malag` o, M. Matteucci, and G. Pistone. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Foundations of Genetic Algorithms, pages 230–242. 2011. [15] A. Miyamae, Y. Nagata, I. Ono, and S. Kobayashi. Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks. In Advances in Neural Information Processing Systems 23, pages 1660–1668, 2010.

[16] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008. [17] J. Poland and A. Zell. Main vector adaptation: A CMA variant with linear time and space complexity. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1050–1055, 2001. [18] R. Ros and N. Hansen. A simple modification in CMA-ES achieving linear time and space complexity. Parallel Problem Solving from Nature–PPSN X, pages 296–305, 2008. [19] Y. Sun, F. Gomez, T. Schaul, and J. Schmidhuber. A Linear Time Natural Evolution Strategy for Non-Separable Functions. GECCO’13 Companion, pages 61–62, 2013. [20] T. Suttorp, N. Hansen, and C. Igel. Efficient covariance matrix update for variable metric evolution strategies. Machine Learning, 75(2):167–197, 2009. [21] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In IEEE Congress on Evolutionary Computation, pages 3381–3387, 2008.

APPENDIX A.

Next, each element of ∇θD ln pθ (x) is

PROOFS

The following two well-known formulas are used repeatedly in the sequel. Proposition A.1 (Sherman-Morrison). Suppose that A is invertible and v is a column vector. Then, A + vv T is invertible if and only if 1 + v T A−1 v 6= 0 and the inverse is −1 T −1 A given by A−1 − A1+vvv T A−1 v . Proposition A.2 (Partitioned matrix inversion). Suppose that T is nonsingular. Then, the partitioned matrix [T, U ; V, W ] is nonsingular if and only if the Schur complement ST = W − V T −1 U of T is nonsingular and the inverse is given by −1 −1 T U T 0 −T −1 U = + ST−1 −T −1 V I . V W 0 0 I

Let ei ∈ Rd be the unit vector whose ith element is one and the others are zero. Let δi,j = eT i ej which is one if i = j, 0 otherwise. In the following proofs we often use the following relations −1 ei eT C i D

−1

∂C ∂di

=

∂C ∂vi

= D(ei v T + veT i )D ,

+ CD

ei eT i

Proof of Lemma 3.1. The gradient w.r.t. m is proved in the main text. We are going to prove the gradients w.r.t. v and θD . Let y = σ −1 D−1 (x − m). We start from the equation given in the main text: = +

∂mT −1 Σ (x ∂θi 1 2

Tr

− m)

∂Σ −1 Σ−1 ∂θ Σ i

(x − m)(x − m) − Σ . T

Each element of ∇v ln pθ (x) is given as ∂ ln pθ (x) ∂vi

=

1 2

∂C Tr C −1 ∂d C −1 [Dyy T D − C] i

−1 −1 Tr C −1 (ei eT C + CD−1 ei eT [Dyy T D − C] i D i )C −1 = Tr C −1 ei eT CC −1 [Dyy T D − C] i D =

1 2

−1 = eT [Dyy T D − C]C −1 ei i D

T T −1 −1 = eT D − D−1 ]ei i [yy (I + vv )

T = eT i [yy (I −

1 vv T )D−1 1+kvk2

T −1 = eT − i [yy D

− D−1 ]ei

hy,vi yv T D−1 1+kvk2

− D−1 ]ei

hy,vi −1 = yi yi d−1 − 1+kvk − d−1 2 yi vi di i i hy,vi T −1 = ei D y ⊙ y − 1+kvk2 y ⊙ v − 1 .

Therefore, we have ∇θD ln pθ (x) = D−1 y ⊙ y − (1 + kvk2 )−1 hy, vi y ⊙ v − 1 . This ends the proof.

where di is the ith diagonal element of D and vi is the ith element of v.

∂ ln pθ (x) ∂θi

∂ ln pθ (x) ∂[θD ]i

Proof of Lemma 3.2. We start from the well-known formula stated in the proof idea of Lemma 3.2 in the main text: [I]i,j =

∂m −1 ∂mT Σ ∂θj ∂θi

+

1 2

∂Σ −1 ∂Σ Tr Σ−1 ∂θ Σ ∂θj i

.

Since Im is derived in the main text, we are going to derive an explicit form of IC . Each element of Iv,v is [Iv,v ]i,j

Tr − m)(x − m) − C] −1 T ∂C Tr C −1 ∂v C [Dyy D − C] i ∂C C −1 ∂v C −1 [(x i

=

1 2

=

1 2

= =

1 Tr C −1 D(ei v T 2 −1 eT [Dyy T D i DC

=

−1 eT Dyy T DC −1 Dv i DC

T

−1 + veT [Dyy T D − C] i )DC

− C]C −1 Dv

−1 − eT Dv i DC

T −1 T −1 = eT yy T (I + vv T )−1 v − eT v i (I + vv ) i (I + vv ) T T 1 1 = eT I − 1+kvk yy T I − 1+kvk v 2 vv 2 vv i T 1 v − eT I − 1+kvk 2 vv i 2 hy,vi 1 hy, vi − hy,vikvk − eT = eT y − 1+kvk 2v i 1+kvk2 v i 1+kvk2 i h hy,vi2 +1+kvk2 1 = eT v . i 1+kvk2 hy, vi y − 1+kvk2

Rewriting it in a vector form, we have ∇v ln pθ (x) = (1 + kvk2 )−1 hy, vi y

− (1 + kvk2 )−1 (hy, vi2 + 1 + kvk2 )v .

=

1 2

∂C ∂C Tr C −1 ∂v C −1 ∂v i j

σ4 2

−1 Tr C −1 (ei v T + veT (ej v T + veT i )C j ) T −1 T T −1 T = Tr (I + vv ) ei v (I + vv ) ej v + Tr (I + vv T )−1 ei v T (I + vv T )−1 veT j

=

T −1 = eT vv T (I + vv T )−1 ej i (I + vv )

T −1 + (v T (I + vv T )−1 v)eT ej i (I + vv ) T T −1 = eT v)(I + vv T )−1 i [(v (I + vv )

+ (I + vv T )−1 vv T (I + vv T )−1 ]ej h i kvk2 T −1 T 1 = eT (I + vv ) + vv ej i 2 2 2 1+kvk (1+kvk ) i h 2 kvk T T 1 1 vv vv + ej I − = eT 2 2 2 2 i 1+kvk 1+kvk (1+kvk ) h i 2 2 kvk kvk T T 1 = eT I − (1+kvk + (1+kvk ej 2 )2 vv 2 )2 vv i 1+kvk2 2 2 kvk − 1 kvk I− vv T ej = eT i 1 + kvk2 (1 + kvk2 )2 i h 1−kvk2 T 2 1 = eT ej . i 1+kvk2 kvk I + 1+kvk2 vv

Morisson formula, we have the inverse of Iv,v " # 1−kvk2

Each element of ID,D is [ID,D ]i,j =

1 2

=

1 2

∂C ∂C Tr C −1 ∂d C −1 ∂d i j Tr C

−1

−1 (ei eT C i D

+ CD

−1

−1 −1 = Tr C −1 ei eT CC −1 (ej eT C + CD−1 ej eT i D j D j ) −1 −1 −1 = eT CC −1 (ej eT C + CD−1 ej eT ei i D j D j )C

−1 −1 −1 −1 = eT ej eT CC −1 ei + eT CD−1 ej eT ei i D j D i D j C T T −1 −2 1 (I + vv T )D−1 ei = eT ej + eT i D i (I − 1+kvk2 vv )ej ej D vi vj −1 −2 = eT ej + δi,j − 1+kvk d−1 i D 2 i dj (δi,j + vi vj ) vi2 vj2 kvk2 δi,j vi vj −1 −2 2 d−1 − 1+kvk = eT ej + δi,j + 1+kvk i D 2 2 i dj vi2 vj2 kvk2 δi,j vi vj −1 2 d−1 − = 2δi,j + 1+kvk 2 i dj 1+kvk2 −1 1 = eT 2(1 + kvk2 )I + kvk2 V 2 − V vv T V D−1 ej . i 1+kvk2 D

Each element of ID,v is

−1 Tr C −1 (ei eT C + CD−1 ei eT i D i ) −1 T T · C D(ej v + vej )D

−1 = Tr C −1 ei eT CC −1 D(ej v T + veT i D j )D −1 −1 = eT CC −1 D(ej v T + veT ei i D j )DC

=

1+kvk2 kvk2

h I−

provided v 6= 0. First, observe −1 Iv,v Iv,D

= = =

=

=

1+kvk2 kvk2

h I−

(1−kvk2 ) vv T 2

i

vv T 1+kvk2 1−kvk2 1+ 1+kvk2

(1−kvk2 ) vv T 2

2+kvk2 I 1+kvk2

i

kvk2 vv T 1+kvk2

V D−1 h i 2 ) 1 I − (1−kvk vv T (2 + kvk2 )I − kvk2 vv T V D−1 2 kvk2 h 2 2 ) 1 (2 + kvk2 )I − kvk2 vv T − (1−kvk )(2+kvk vv T 2 kvk2 i 2 2 + (1−kvk2 )kvk vv T V D−1 h 1 (2 + kvk2 )I kvk2 i 2 2 4 −kvk2 +kvk4 vv T V D−1 − 2kvk +2−kvk −kvk 2 1 (2 + kvk2 )I − vv T V D−1 kvk2

−1 ID,v Iv,v Iv,D

1 vv T )D−1 ei 1+kvk2

T T T = d−1 i ei (ej v + vej )ei −

vi T d−1 eT i (ej v 1+kvk2 i

d−1 v

Altogether, with γv = 1 + kvk2 , we have Iv,v = γv−1 kvk2 I + (1 − kvk2 )γv−1 vv T , ID,v = γv−1 D−1 V (2 + kvk2 )I − vv T , ID,D = γv−1 D−1 2γv I + kvk2 V 2 − V vv T V D−1

T and Iv,D = ID,v is clear by definition. This completes the proof.

2+kvk2 I 1+kvk2

1 D−1 V kvk2

T 1 − 1+kvk 2 vv · (2 + kvk2 )I − vv T V D−1

=

1 D−1 V kvk2 (1+kvk2 )

−

=

1 D−1 V kvk2 (1+kvk2 )

=

1 D−1 V kvk2 (1+kvk2 )

=

(2 + kvk2 )I − kvk2 vv T · (2 + kvk2 )I − vv T V D−1

(2 + kvk2 )2 I − kvk2 (2 + kvk2 )vvT − (2 + kvk2 )vv T + kvk2 vv T V D−1

+ veT j )v

i 2 i = 2d−1 i vi δi,j − 1+kvk2 (kvk δi,j + vi vj ) 2 1 = d−1 i vi 2δi,j − 1+kvk2 (kvk δi,j + vi vj ) kvk2 1 = d−1 i vi 2δi,j − 1+kvk2 δi,j − 1+kvk2 vi vj 2+kvk2 1 = d−1 i vi 1+kvk2 δi,j − 1+kvk2 vi vj −1 1 = eT V (2 + kvk2 )I − vv T ej i 1+kvk2 D

=

T T −1 = eT ei i (ej v + vej )DC T T = eT i (ej v + vej )(I −

I−

Then,

[ID,v ]i,j 1 2

1+kvk2 kvk2

ei eT i )

−1 · C −1 (ej eT C + CD−1 ej eT j D j )

=

−1 Iv,v =

(2 + kvk2 )2 I

− (2 + kvk2 )2 I − (2 + 2kvk2 + kvk4 )vv T V D−1

(2 + kvk2 )2 I − (2 + 2kvk2 + kvk4 )vv T V D−1

1 D−1 V (1+kvk2 )

Finally we have Sv,v

h = D−1 2I +

2 kvk4 V 1+kvk2

−

kvk4 V 1+kvk2

vv T V

i 2 2 2 2 ) +kvk4 − α2 (2+kvk V + α2 2+2kvk V vv T V D−1 1+kvk2 1+kvk2 h 4 2 2 (2+kvk2 )2 V = D−1 2I + kvk −α 1+kvk2 i 4 2 2 +kvk4 ) − kvk −α (2+2kvk V vv T V D−1 1+kvk2 i h T = 2D−1 A + bvv D−1 .

This ends the proof.

Proof of Proposition 3.5. In the main text, a suffiProof of Lemma 3.3 and 3.4. Lemma 3.3 is the case cient condition is derived: that α = 1 in Lemma 3.4. Hence, it suffices to prove Lemma 3.4. α2 ≤ [kvk4 + (2 − γ)(1 + kvk2 )/ max(v j )]/(2 + kvk2 )2 −1 The Schur complement of Iv,v is Sv,v = ID,D −α2 ID,v Iv,v Iv,D α2 ≥ [kvk4 − (1 − γ)γ(1 + kvk2 )]/(kvk4 + 2kvk2 + 2) . as it is stated in the main text. By applying the Sherman-

It is obvious that the first inequality is satisfied by α in (7). Therefore, we are going to show that the second one is satisfied by this α. First, notice that the RHS of the second inequality is smaller than one. Therefore, the inequality holds if α = 1. If α < 1, the inequality reduces to α2 ≥ [kvk4 − (1 − γ)γ(1 + kvk2 )]/(kvk4 + 2kvk2 + 2)

⇔[kvk4 + (2 − γ)(1 + kvk2 )/ max(v j )]/(2 + kvk2 )2

≥ [kvk4 − (1 − γ)γ(1 + kvk2 )]/(kvk4 + 2kvk2 + 2)

⇔(kvk4 + 2kvk2 + 2)kvk4

+ (kvk4 + 2kvk2 + 2)(2 − γ)(1 + kvk2 )/ max(v j ) ≥ (kvk4 + 4kvk2 + 4)kvk4

− (kvk4 + 4kvk2 + 4)(1 − γ)γ(1 + kvk2 ) 2

4

⇔ − 2(1 + kvk )kvk

+ (kvk4 + 2kvk2 + 2)(2 − γ)(1 + kvk2 )/ max(v j ) ≥ −(kvk4 + 4kvk2 + 4)(1 − γ)γ(1 + kvk2 )

⇔ − 2kvk4 + (kvk4 + 2kvk2 + 2)(2 − γ)/ max(vj ) + (kvk4 + 4kvk2 + 4)(1 − γ)γ ≥ 0

⇐ − 2kvk4 + (kvk4 + 2kvk2 + 2)(2 − γ) + (kvk4 + 4kvk2 + 4)(1 − γ)γ ≥ 0

⇔ − (kvk4 + 4kvk2 + 4)γ 2 + 2(1 + kvk2 )γ + 4(1 + kvk2 ) ≥ 0

⇔(kvk2 + 2)2 γ 2 − 2(1 + kvk2 )γ − 4(1 + kvk2 ) ≤ 0

By solving the right-most side inequality, we obtain the sufficient condition for γ p (1 + kvk ) − (1 + kvk2 )2 + 4(2 + kvk2 )2 (1 + kvk2 ) ≤γ (2 + kvk2 )2 p (1 + kvk2 ) + (1 + kvk2 )2 + 4(2 + kvk2 )2 (1 + kvk2 ) ≤ . (2 + kvk2 )2 2

The LHS is negative, so γ = (1 + kvk2 )−1/2 satisfies the left inequality. The RHS is (1 + kvk2 ) +

p

(1 + kvk2 )2 + 4(2 + kvk2 )2 (1 + kvk2 ) (2 + kvk2 )2 p p 2 2 2 1 + kvk2 4(kvk + 2) (1 + kvk2 ) > = (2 + kvk2 )2 2 + kvk2 1 ≥p =γ . 1 + kvk2 Hence, the right inequality is satisfied. Therefore, the second inequality is satisfied, which completes the proof. Proof of Theorem 3.6. In the main text we have shown ˜ m ln pθ (x) = x − m. We are going to show that that ∇ ˜ v ln pθ (x) = kvk−1 t and ∇ ˜ θ ln pθ (x) = Ds with s and ∇ D t computed in step 4 and step 5.

The modified natural gradient w.r.t. θC is computed as

∇v ln pθ (x) ∇θD ln pθ (x) −1 −1 −αIv,v Iv,D Iv,v ∇v ln pθ (x) + = I 0 ∇v ln pθ (x) −1 −1 · Sv,v −ID,v αIv,v I ∇θD ln pθ (x) −1 −1 −αIv,v Iv,D −1 Iv,v ∇v ln pθ (x) Sv,v + = I 0 −1 · ∇θD ln pθ (x) − αID,v Iv,v ∇v ln pθ (x) −1 −1 −αIv,v Iv,D D Iv,v ∇v ln pθ (x) −1 D−1 Sv,v + = I 0 −1 · D−1 D∇θD ln pθ (x) −αDID,v Iv,v ∇v ln pθ (x) | {z } | {z }

Iv,v αID,v

αIv,D ID,D

−1

=:s1

=

=:t2 /kvk

−1 t2 /kvk −αIv,v Iv,D D −1 −1 D−1 Sv,v D + 0 D · (s1 − (α/kvk)DID,v t2 ) {z } |

=:s3

−1 t /kvk −αIv,v Iv,D D −1 −1 D−1 Sv,v D s3 = 2 + 0 D | {z } =:s4

=

  −1 t2 /kvk − αIv,v Iv,D Ds4 {z } |   =:t5 /kvk

.

Ds4

According to Lemma 3.1, it is easy to see that s1 is computed by s in step 1. In the proof of Lemma 3.4, we have −1 derived the explicit forms of Iv,v . From this and Lemma 3.1, we have t2 = kvk

1+kvk2 kvk2

h I−

(1−kvk2 ) vv T 2

i

· γv−1 hy, vi y − γv−1 (hy, vi2 + γv )v h i 2 ) 1 = kvk I − (1−kvk vv T hy, vi y − γv−1 (hy, vi2 + γv )v 2 h 1 = kvk hy, vi y − γv−1 (hy, vi2 + γv )v i 2 2 ) ) (hy, vi2 + γv )v hy, vi2 v + (1−kvk − (1−kvk 2kvk 2γv h i 2 ) 1 = kvk hy, vi y − (1−kvk hy, vi2 v − 21 (hy, vi2 + γv )v 2kvk = hy, vi y − 2−1 (hy, vi2 + γv )v .

Hence t2 = t in step 2. According to Lemma 3.2, s3 = s1 − (α/kvk)γv−1 V (2 + kvk2 )I − vv T t2 = s1 − αγv−1 V (2 + kvk2 )I − kvk2 vv T t2 = s1 − αγv−1 (2 + kvk2 )v ⊙ t2 − kvk2 hv, t2 i v ,

which is equivalent to s in step 3. According to Proposition 3.5, T T s4 = A−1 − (1 + bv A−1 v)−1 bA−1 vv A−1 s3

T = A−1 s3 − (1 + bv A−1 v)−1 b A−1 v, s3 A−1 v

and s4 = s in step 4. In the proof of Lemma 3.4, we have −1 derived the explicit forms of Iv,v Iv,D . With this, we have T 2 1 V s4 t5 = t2 − αkvk kvk 2 (2 + kvk )I − vv = t2 − α (2 + kvk2 )I − vv T V s4

= t2 − α (2 + kvk2 )v ⊙ s4 − v, s4 v .

˜ v ln pθ (x) = This is equivalent to t in step 5. Therefore, ∇ −1 −1 ˜ kvk t5 = kvk t and ∇θD ln pθ (x) = Ds4 = Ds.

Functional Gradient Descent Optimization for ... - public.asu.edu