LECTURE 8: OPTIMIZATION in HIGHER DIMENSIONS

• Optimization (maximization/minimization) is of huge importance in data analysis and is the basis for recent breakthroughs in machine learning and big data • A lot of it is application dependent and there is a vast number of methods developed: we cannot cover them all in this lecture • Broadly can be divided into 1st order (derivatives are available, but not Hessian) and 2nd order (approximate Hessian or full Hessian evaluation) • 0th order: no gradients available: use finite difference to get the gradient or use downhill simplex (Nelder & Mead method). Very slow and we will not discuss them here.

Preparation of parameters • Often the parameters are not unconstrained: they may be positive (or negative), or bounded to an interval • First step is to make optimization unconstrained: map the parameter to a new parameter that is unbounded. For example, if a variable is positive, x>0, use z=log(x) instead of x. • One can also change the prior so that it reflects the original prior: ppr(z)dz=ppr(x)dx • If x>o has uniform prior in x then ppr(z)=dx/dz=x=ez

General strategy • We want to descend down a function J(a) (if minimizing) using iterative sequence of steps at at. For this we need to choose a direction pt and move in that direction:J(at+hpt) • A few options: fix h • line search: vary h until J(at+hpt) is minimized • Trust region: construct an approximate quadratic model for J and minimize it but only within trust region where quadratic model is approximately valid

Line search directions and backtracking • Gradient descent: Gradient -▽a J(a,xt) • Newton: Inverse Hessian H-1 times gradient -H-1 ▽a J(a) • Quasi-Newton: approximate H-1 with B-1 (SR1 and BFGS) • Nonlinear conjugate gradient: pt= -▽a J(a,xt)+btpt-1 , where pt-1 and pt are conjugate • Step length with backtracking: choose first proposed length • If it does not reduce the function value reduce it by some factor, check again • Repeat until step length is e, at that point switch to gradient descent

Trust region method • Multi-dim parabola method: define approximate quadratic function, but limit the step • Here Dk is trust region radius • Evaluate at previous iteration and compare the actual reduction to predicted reduction

• If rk around 1 we can increase Dk • If close to 0 or negative we shrink Dk

• Direction changes if Dk changes • If trust region covers p=0 step there

Line search vs trust region

1st order: gradient descent • We have a vector of parameters a and a scalar loss (cost) function J(a,x,y) which is a function of a data vector (x,y) we want to optimize (say minimize). This could be a nonlinear least square loss function: J=c2 • (Batch) gradient descent updates all the variables at once: da=-h▽a J(a): in ML h is called learning rate • It gets stuck on saddle points, where gradient is 0 everywhere (see animation later)

Scaling • Change variables to make surface more circular • Example: change of dimensions

Stochastic gradient descent

• Stochastic gradient descent: do this just for one data pair xi,yi: da=-h▽a J(a,xi,yi) • This saves on computational cost, but is noisy, so one repeats it by randomly choosing data i • Has large fluctuations in the cost function

• This is potentially a good thing: it may avoid getting stuck in the local minima (or saddle points) • Learning rate is slowly reduced • Has revolutionized machine learning

Mini-batch stochastic descent

• Mini-batch takes advantage of hardware and software implementations where a gradient wrt to a number of data points can be evaluated as fast as a single data (e.g. minibatch of N=256) • Challenges of (stochastic) gradient descent: how to choose learning rate (in 2nd order methods this is given by Hessian) • Ravines:

Ravines

Adding momentum: rolling down the hill • We can add momentum and mimic a ball rolling down the hill • Use previous update as the direction • vt=gvt-1+h▽a J(a), da=-vt with g of order 1 (e.g. 0.9) • Momentum increases for directions where gradient does not change

Nesterov accelerated gradient • We can predict where to evaluate next gradient using previous velocity update • vt=gvt-1+h▽a J(a-gvt-1), da=-vt • Momentum (blue) vs NAG (brown+red=green)

• See https://arxiv.org/abs/1603.04245 for theoretical justification of NAG based on a Bregman divergence Lagrangian

Adagrad, adadelta, Rmsprop, ADAM… • Make the learning rate h dependent on ai • Use past gradient information to update h • Example ADAM: ADAptive Momentum estimation • mt=β1mt-1+(1−β1)gt gt=▽a J(a) • vt=β2vt-1+(1−β2)gt2 • bias correction: mt’=mt/(1-β1), vt’=vt/(1-β2) • Update rule: da=-h/(vt’1/2+e) • Recommended values b1=0.9, b2=0.999, e=10-8 • The methods are empirical (show animation)

nd 2

order methods: Newton

• We have seen that there is no natural way to choose learning rate in 1st order methods • But Newton’s method provides a clear answer what the learning rate should be: • J(a+da)=J(a)+da▽a J(a)+dada’ ▽a ▽a’ J(a)/2… • Hessian Hij= ▽a_i ▽a_j J(a) • At the extremum we we want ▽a J(a)=0 so a Newton update step is da=-H-1 ▽a J(a) • We do not need to guess the learning rate • We do need to evaluate Hessian and invert it (or use LU): expensive in many dimensions! • In many dimensions we use iterative schemes to solve this problem

Quasi-Newton • Computing Hessian and inverting it is expensive, but one can approximate it with a low rank tensor • Symmetric rank 1 (SR1) • BFGS (rank 2 update, positive definite) • Inverse (Woodburry formula)

L-BFGS • For large problems this gets too expensive. Limited memory BFGS updates only based on last N iterations (N of order 10-100) • In practice increasing N often does not improve the results • Historical note: quasi-Newton methods originate from W.C. Davidon’s work in 1950s, a physicist at Argonne national lab.

Linear conjugate direction • Is an iterative method to solve Ax=b (so belongs to linear algebra) • Can be used for optimization: min J=xTAx-bTx • Conjugate vectors: piApj=0 for all i,j not equal i • Construction similar to Gram-Schmidt (QR), where A plays the role of scalar product norm: xk+1=xk+akpk where ak=-rkTpk/(pkTApk) and rk=Axk-b • Essentially we are taking a dot product (with A norm) of the vector with previous vectors to project it perpendicular to previous vectors • Since the space is N-dim after N steps we have spanned the full space and converged to true solution, rN=0.

Conjugate direction • If we have the matrix A in diagonal form so that basis vectors are orthogonal we can find the minimum trivially along the axes, otherwise not

Linear conjugate gradient

• Computes pk from pk-1 • We want the step to be linear combination of residual –rk and previous direction pk-1 • pk=–rk +bkpk-1 premultiply by pTk-1A • bk= (rkApk-1)/(pTk-1Apk-1) imposing pTk-1Apk=0 • Converges rapidly for similar eigenvalues, not so much if condition number is high

Preconditioning • Tries to improve condition number of A by multiplying by another matrix C that is simple

• We wish to reduce condition number of • Example: incomplete Cholesky A=LLT by computing only a sparse L • Preconditioners are very problem specific

Nonlinear conjugate gradient • Replace ak with line search that minimizes J, and use xk+1=xk+akpk • Replace rk=Axk-b with gradient of J: ▽a J • This is Fletcher-Reeves version, Polak-Ribiere modifies b • CG is one of the most competitive methods, but requires the Hessian to have low condition number • Typically we do a few CG steps at each k, then move on to a new gradient evaluation

CG vs gradient descent In 2d CG has to converge in 2 steps

Gauss-Newton for nonlinear least squares

Line search in direction da

We drop 2nd term in Hessian because residual r=yi-y is small, fluctuates around 0 and because y’’ may be small (or zero for linear problems)

Gauss-Newton + trust region = Levenberg-Marquardt method • Solving ATAda=ATb is equivalent to minimize |Ada-b|2 • if trust region is within the solution just solve this equation • If not we need to impose ||da||=Dk • Lagrange multiplier minimization equivalent to (ATA+lI)da=ATb and l(D-||da||)=0 • For small l this is Gauss-Newton (use close to minimum), for large l this is steepest descent (use far from minimum) • A good method for nonlinear least squares

Literature • • • •

Numerical Recipes Ch. 9, 10, 15 Newman, Ch. 6 Nocedal and Wright, Optimization https://arxiv.org/abs/1609.04747

LECTURE 8: OPTIMIZATION in HIGHER DIMENSIONS - GitHub

We want to descend down a function J(a) (if minimizing) using iterative sequence of steps at a t . For this we need to choose a direction p t and move in that direction:J(a t. +ηp t. ) • A few options: fix η. • line search: vary η ... loss (cost) function J(a,x,y) which is a function of a data vector (x,y) we want to optimize (say minimize).

1MB Sizes 0 Downloads 323 Views

Recommend Documents

8 Dimensions Of Excellence
majority of managers and professionals across all industries. ... Business process improvement. • Cycle time ... is merely one of the eight cylinders of the enter- prise engine we ... Product. © 2006, International Management Technologies Inc.

Lecture 1 - GitHub
Jan 9, 2018 - We will put special emphasis on learning to use certain tools common to companies which actually do data ... Class time will consist of a combination of lecture, discussion, questions and answers, and problem solving, .... After this da

Transcriptomics Lecture - GitHub
Jan 17, 2018 - Transcriptomics Lecture Overview. • Overview of RNA-Seq. • Transcript reconstruc囉n methods. • Trinity de novo assembly. • Transcriptome quality assessment. (coffee break). • Expression quan懿a囉n. • Differen鶯l express

Batch optimization in VW via LBFGS - GitHub
Dec 16, 2011 - gt. Problem: Hessian can be too big (matrix of size dxd) .... terminate if either: the specified number of passes over the data is reached.

Measuring the dimensions of quality in higher education - PDFKUL.COM
education. Stanley M. Widrick,1 Erhan Mergen1 & Delvin Grant2. 1College of Business, Rochester Institute of Technology, 108 Lomb Memorial Drive, .... Quality of performance deals with how well a service and/or product performs in the eyes of ..... wa

Measuring the dimensions of quality in higher education
1College of Business, Rochester Institute of Technology, 108 Lomb Memorial Drive, Rochester, NY. 14423, USA ... ISSN 0954-4127 print/ISSN 1360-0613 online/02/010123-09 ... This is relatively easy in .... developed speci®c product ideas.

Stochastic Program Optimization - GitHub
114 COMMUNICATIONS OF THE ACM. | FEBRUARY 2016 | VOL. 59 | NO. 2 research ..... formed per second using current symbolic validator tech- nology is quite low. ... strained to sample from identical equivalence classes before and after ...

6. Hodge-Riemann bilinear form in higher dimensions - IBS-CGP
Jul 15, 2016 - Rm via the moment map µ under the assumption ...... of quotients in symplectic and algebraic geometry, Princeton University Press, 1984.

Measuring the dimensions of quality in higher education
of conformance and quality of performance) in higher education. ... examples are discussed of how Rochester Institute of Technology has used this approach to ...

lecture 15: fourier methods - GitHub
LECTURE 15: FOURIER METHODS. • We discussed different bases for regression in lecture. 13: polynomial, rational, spline/gaussian… • One of the most important basis expansions is ... dome partial differential equations. (PDE) into ordinary diffe

lecture 12: distributional approximations - GitHub
We have data X, parameters θ and latent variables Z (which often are of the ... Suppose we want to determine MLE/MAP of p(X|θ) or p(θ|X) over q: .... We limit q(θ) to tractable distributions. • Entropies are hard to compute except for tractable

lecture 4: linear algebra - GitHub
Inverse and determinant. • AX=I and solve with LU (use inv in linalg). • det A=L00. L11. L22 … (note that Uii. =1) times number of row permutations. • Better to compute ln detA=lnL00. +lnL11. +…

Linear and Discrete Optimization - GitHub
This advanced undergraduate course treats basic principles on ... DISCLAIMER : THIS ONLINE OFFERING DOES NOT REFLECT THE ENTIRE CURRICULUM ... DE LAUSANNE DEGREE OR CERTIFICATE; AND IT DOES NOT VERIFY THE.

lecture 16: ordinary differential equations - GitHub
Bulirsch-Stoer method. • Uses Richardson's extrapolation again (we used it for. Romberg integration): we estimate the error as a function of interval size h, then we try to extrapolate it to h=0. • As in Romberg we need to have the error to be in

Old Dominion University Lecture 2 - GitHub
Old Dominion University. Department of ... Our Hello World! [user@host ~]$ python .... maxnum = num print("The biggest number is: {}".format(maxnum)) ...

Lecture 8: gold standard
Central Bank Credibility: “Good” Speculative Capital Flow. International ... Supply at Home. (Increase Money Supply Abroad)→Price Deflation at Home (Price.

lecture 5: matrix diagonalization, singular value ... - GitHub
We can decorrelate the variables using spectral principal axis rotation (diagonalization) α=XT. L. DX. L. • One way to do so is to use eigenvectors, columns of X. L corresponding to the non-zero eigenvalues λ i of precision matrix, to define new

C2M - Team 101 lecture handout.pdf - GitHub
Create good decision criteria in advance of having to make difficult decision with imperfect information. Talk to your. GSIs about their experience re: making ...

lecture 2: intro to statistics - GitHub
Continuous Variables. - Cumulative probability function. PDF has dimensions of x-1. Expectation value. Moments. Characteristic function generates moments: .... from realized sample, parameters are unknown and described probabilistically. Parameters a

lecture 10: advanced bayesian concepts - GitHub
some probability distribution function (PDF) of perfect data x, but what we measure is d, a noisy version of x, and noise is ... We can also try to marginalize over x analytically: convolve true PDF with noise PDF and do this for each ... We would li

EE 396: Lecture 8
Mar 8, 2011 - many fields including electrical engineering and financial data .... used symmetry of w, Fubini's Theorem to interchange order of integration, and ...

Lecture # 8 Decision Support Systems.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.