Why Momentum Really Works

[ad_1]

Step-size α = 0.02 Momentum β = 0.99

We regularly consider Momentum as a method of dampening oscillations and dashing up the iterations, resulting in sooner convergence. However it has different fascinating habits. It permits a bigger vary of step-sizes for use, and creates its personal oscillations. What’s going on?

Right here’s a preferred story about momentum [1, 2, 3]: gradient descent is a person strolling down a hill. He follows the steepest path downwards; his progress is sluggish, however regular. Momentum is a heavy ball rolling down the identical hill. The added inertia acts each as a smoother and an accelerator, dampening oscillations and inflicting us to barrel via slim valleys, small humps and native minima. This commonplace story isn’t fallacious, however it fails to clarify many essential behaviors of momentum. Actually, momentum could be understood way more exactly if we research it on the correct mannequin. One good mannequin is the convex quadratic. This mannequin is wealthy sufficient to breed momentum’s native dynamics in actual issues, and but easy sufficient to be understood in closed kind. This stability offers us highly effective traction for understanding this algorithm.

We start with gradient descent. The algorithm has many virtues, however pace will not be considered one of them. It’s easy — when optimizing a easy perform

f

, we make a small step within the gradient

$w^{okay+1} = w^k-alphanabla f(w^okay).$

For a step-size sufficiently small, gradient descent makes a monotonic enchancment at each iteration. It at all times converges, albeit to a neighborhood minimal. And beneath a number of weak curvature situations it could even get there at an exponential charge.

However the exponential lower, although interesting in principle, can usually be infuriatingly small. Issues usually start fairly nicely — with a powerful, nearly rapid lower within the loss. However because the iterations progress, issues begin to decelerate. You begin to get a nagging feeling you’re not making as a lot progress as you ought to be. What has gone fallacious?

The issue could possibly be the optimizer’s outdated nemesis, pathological curvature. Pathological curvature is, merely put, areas of

$f$

Momentum proposes the next tweak to gradient descent. We give gradient descent a short-term reminiscence:

$start{aligned} z^{okay+1}&=beta z^{okay}+nabla f(w^{okay})[0.4em] w^{okay+1}&=w^{okay}-alpha z^{okay+1} finish{aligned}$

The change is harmless, and prices nearly nothing. When

$beta = 0$

$beta = 0.99$

$0.999$

Optimizers name this minor miracle “acceleration”.

The brand new algorithm could appear at first look like an affordable hack. A easy trick to get round gradient descent’s extra aberrant habits — a smoother for oscillations between steep canyons. However the fact, if something, is the opposite manner spherical. It’s gradient descent which is the hack. First, momentum offers as much as a quadratic speedup on many features. ¹ That is no small matter — that is much like the speedup you get from the Quick Fourier Rework, Quicksort, and Grover’s Algorithm. When the universe offers you quadratic speedups, it is best to begin to concentrate.

However there’s extra. A decrease certain, courtesy of Nesterov [5], states that momentum is, in a sure very slim and technical sense, optimum. Now, this doesn’t imply it’s the greatest algorithm for all features in all circumstances. However it does fulfill some curiously lovely mathematical properties which scratch a really human itch for perfection and closure. However extra on that later. Let’s say this for now — momentum is an algorithm for the e-book.

First Steps: Gradient Descent

We start by finding out gradient descent on the only mannequin attainable which isn’t trivial — the convex quadratic,

$f(w) = tfrac{1}{2}w^TAw – b^Tw, qquad w in mathbf{R}^n.$

Assume

$A$

$w^{star}$

$w^{star} = A^{-1}b.$

Easy as this mannequin could also be, it’s wealthy sufficient to approximate many features (consider

$A$

That is the way it goes. Since

$nabla f(w)=Aw – b$

$w^{okay+1}=w^{okay}- alpha (Aw^{okay} – b).$

Right here’s the trick. There’s a very pure area to view gradient descent the place all the size act independently — the eigenvectors of

$A$

Each symmetric matrix

$A$

$A=Q textual content{diag}(lambda_{1},ldots,lambda_{n}) Q^{T},qquad Q = [q_1,ldots,q_n],$

and, as per conference, we’ll assume that the

$lambda_i$

$lambda_1$

$lambda_n$

$x^{okay} = Q^T(w^{okay} – w^star)$

$start{aligned} x_{i}^{okay+1} & =x_{i}^{okay}-alpha lambda_ix_{i}^{okay} [0.4em] &= (1-alphalambda_i)x^k_i=(1-alpha lambda_i)^{okay+1}x^0_i finish{aligned}$

Transferring again to our unique area

$w$

$w^okay – w^star = Qx^okay=sum_i^n x^0_i(1-alphalambda_i)^okay q_i$

and there we have now it — gradient descent in closed kind.

Decomposing the Error

The above equation admits a easy interpretation. Every ingredient of

$x^0$

$Q$

$n$

$1-alphalambda_i$

$1$

For many step-sizes, the eigenvectors with largest eigenvalues converge the quickest. This triggers an explosion of progress within the first few iterations, earlier than issues decelerate because the smaller eigenvectors’ struggles are revealed. By writing the contributions of every eigenspace’s error to the loss

$f(w^{okay})-f(w^{star})=sum(1-alphalambda_{i})^{2k}lambda_{i}[x_{i}^{0}]^2$

Optimization could be seen as mixture of a number of element issues, proven right here as 1 2 3 with eigenvalues

$lambda_1=0.01$

$lambda_2=0.1$

$lambda_3=1$

Step-size

Optimum Step-size

Selecting A Step-size

The above evaluation offers us rapid steering as to learn how to set a step-size

$alpha$

$|1-alpha lambda_i|$

$0<alphalambda_i<2.$

The general convergence charge is set by the slowest error element, which should be both

$lambda_1$

$lambda_n$

$start{aligned}textual content{charge}(alpha) & ~=~ max_{i}left|1-alphalambda_{i}proper|[0.9em] & ~=~ maxleft{|1-alphalambda_{1}|,~ |1-alphalambda_{n}|proper} finish{aligned}$

This total charge is minimized when the charges for

$lambda_1$

$lambda_n$

$start{aligned} textual content{optimum }alpha ~=~{mathop{textual content{argmin}}limits_alpha} ~textual content{charge}(alpha) & ~=~frac{2}{lambda_{1}+lambda_{n}}[1.4em] textual content{optimum charge} ~=~{min_alpha} ~textual content{charge}(alpha) & ~=~frac{lambda_{n}/lambda_{1}-1}{lambda_{n}/lambda_{1}+1} finish{aligned}$

Discover the ratio

$lambda_n/lambda_1$

$textual content{situation quantity} := kappa :=frac{lambda_n}{lambda_1}$

$A^{-1}b$

$b$

$kappa = 1$

Instance: Polynomial Regression

The above evaluation reveals an perception: all errors usually are not made equal. Certainly, there are completely different sorts of errors,

$n$

$A$

Lets see how this performs out in polynomial regression. Given 1D information,

$xi_i$

$textual content{mannequin}(xi)=w_{1}p_{1}(xi)+cdots+w_{n}p_{n}(xi)qquad p_{i}=ximapstoxi^{i-1}$

to our observations,

$d_i$

$xi$

Due to the linearity, we are able to match this mannequin to our information

$xi_i$

$textual content{decrease}_w qquadtfrac{1}{2}sum_i (textual content{mannequin}(xi_{i})-d_{i})^{2} ~~=~~ tfrac{1}{2}|Zw – d|^2$

$Z=left(start{array}{ccccc} 1 & xi_{1} & xi_{1}^{2} & ldots & xi_{1}^{n-1} 1 & xi_{2} & xi_{2}^{2} & ldots & xi_{2}^{n-1} vdots & vdots & vdots & ddots & vdots 1 & xi_{m} & xi_{m}^{2} & ldots & xi_{m}^{n-1} finish{array}proper).$

The trail of convergence, as we all know, is elucidated after we view the iterates within the area of

$Q$

$Z^T Z$

$Q$

$w$

$Qw$

$p$

$bar{p}$

$textual content{mannequin}(xi)~=~x_{1}bar{p}_{1}(xi)~+~cdots~+~x_{n}bar{p}_{n}(xi)qquad bar{p}_{i}=sum q_{ij}p_j.$

This mannequin is similar to the outdated one. However these new options

$bar{p}$

$n$

The observations within the above diagram could be justified mathematically. From a statistical perspective, we want a mannequin which is, in some sense, sturdy to noise. Our mannequin can’t presumably be significant if the slightest perturbation to the observations modifications your complete mannequin dramatically. And the eigenfeatures, the principal parts of the info, give us precisely the decomposition we have to type the options by its sensitivity to perturbations in

$d_i$

This measure of robustness, by a fairly handy coincidence, can also be a measure of how simply an eigenspace converges. And thus, the “pathological instructions” — the eigenspaces which converge the slowest — are additionally these that are most delicate to noise! So beginning at a easy preliminary level like

$0$

This impact is harnessed with the heuristic of early stopping : by stopping the optimization early, you possibly can usually get higher generalizing outcomes. Certainly, the impact of early stopping is similar to that of extra standard strategies of regularization, resembling Tikhonov Regression. Each strategies attempt to suppress the parts of the smallest eigenvalues straight, although they make use of completely different strategies of spectral decay.² However early stopping has a definite benefit. As soon as the step-size is chosen, there aren’t any regularization parameters to fiddle with. Certainly, in the midst of a single optimization, we have now your complete household of fashions, from underfitted to overfitted, at our disposal. This reward, it appears, doesn’t come at a value. An attractive free lunch [7] certainly.

The Dynamics of Momentum

Let’s flip our consideration again to momentum. Recall that the momentum replace is

$start{aligned} z^{okay+1}&=beta z^{okay}+nabla f(w^{okay})[0.4em] w^{okay+1}&=w^{okay}-alpha z^{okay+1}. finish{aligned}$

Since

$nabla f(w^okay) = Aw^okay – b$

$start{aligned} z^{okay+1}&=beta z^{okay}+ (Aw^{okay}-b)[0.4em] w^{okay+1}&=w^{okay}-alpha z^{okay+1}. finish{aligned}$

Following [8], we undergo the identical motions, with the change of foundation

$x^{okay} = Q(w^{okay} – w^star)$

$y^{okay} = Qz^{okay}$

$start{aligned} y_{i}^{okay+1}&=beta y_{i}^{okay}+lambda_{i}x_{i}^{okay}[0.4em] x_{i}^{okay+1}&=x_{i}^{okay}-alpha y_{i}^{okay+1}. finish{aligned}$

through which every element acts independently of the opposite parts (although

$x^k_i$

$y^k_i$

$left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)=R^kleft(!!start{array}{c} y_{i}^{0} x_{i}^{0} finish{array}!!proper) qquad R = left(!!start{array}{cc} beta & lambda_{i} -alphabeta & 1-alphalambda_{i} finish{array}!!proper).$

$okay^{th}$

$2 instances 2$

$R$

$sigma_1$

$sigma_2$

$shade{#AAA}{shade{black}{R^{okay}}=start{circumstances} shade{black}{sigma_{1}^{okay}}R_{1}-color{black}{sigma_{2}^{okay}}R_{2} & sigma_{1}neqsigma_{2} sigma_{1}^{okay}(kR/sigma_1-(k-1)I) & sigma_{1}=sigma_{2} finish{circumstances},qquad R_{j}=frac{R-sigma_{j}I}{sigma_{1}-sigma_{2}}}$

This method is fairly sophisticated, however the takeaway right here is that it performs the very same position the person convergence charges,

$1-alphalambda_i$

$max{|sigma_{1}|,|sigma_{2}|}$

Convergence Fee

A plot of

$maxsigma_2$

For what values of

$alpha$

$beta$

$sigma_1$

$sigma_2$

$max{|sigma_{1}|,|sigma_{2}|} < 1$

$0<alphalambda_{i}<2+2beta qquad textual content{for} qquad 0 leq beta < 1$

We recuperate the earlier end result for gradient descent when

$beta = 0$

The Crucial Damping Coefficient

The true magic occurs, nevertheless, after we discover the candy spot of

$alpha$

$beta$

Momentum admits an fascinating bodily interpretation when

$alpha$

$y_{i}^{okay+1}$

$=$

$+$

$lambda_{i}x_{i}^{okay}$

and perturbed by an exterior pressure discipline

We are able to consider

$-y_i^okay$

$beta y_{i}^{okay}$

which is dampened at every step

$x_i^{okay+1}$

$=$

$x_i^okay – alpha y_i^{okay+1}$

And

$x$

which is moved at every step by a small quantity within the course of the speed

$y^{okay+1}_i$

We are able to break this equation aside to see how every element impacts the dynamics of the system. Right here we plot, for

$150$

This method is greatest imagined as a weight suspended on a spring. We pull the load down by one unit, and we research the trail it follows because it returns to equilibrium. Within the analogy, the spring is the supply of our exterior pressure

$lambda_ix^k_i$

$x^k_i$

$y^k_i$

$beta$

The crucial worth of

$beta = (1 – sqrt{alpha lambda_i})^2$

$i$

$1 – sqrt{alphalambda_i}.$

$1-alphalambda_i$

$i^{th}$

$alpha$

Optimum parameters

To get a world convergence charge, we should optimize over each

$alpha$

$beta$

$alpha = left(frac{2}{sqrt{lambda_{1}}+sqrt{lambda_{n}}}proper)^{2} quad beta = left(frac{sqrt{lambda_{n}}-sqrt{lambda_{1}}}{sqrt{lambda_{n}}+sqrt{lambda_{1}}}proper)^{2}$

$frac{sqrt{kappa}-1}{sqrt{kappa}+1}$

Convergence charge, Momentum

$frac{kappa-1}{kappa+1}$

Convergence charge, Gradient Descent

With barely a modicum of additional effort, we have now basically sq. rooted the situation quantity! These positive aspects, in precept, require express information of

$lambda_1$

$lambda_n$

$alpha$

$1$

$beta$

$1$

$alpha$

We are able to do the identical decomposition right here with momentum, with eigenvalues

$lambda_1=0.01$

$lambda_2=0.1$

$lambda_3=1$

$f(w^okay) – f(w^star)$

Notice that the optimum parameters don’t essentially indicate the quickest convergence, although, solely the quickest asymptotic convergence charge.

Step-size α =

Momentum β =

Whereas the loss perform of gradient descent had a swish, monotonic curve, optimization with momentum shows clear oscillations. These ripples usually are not restricted to quadratics, and happen in all types of features in observe. They don’t seem to be trigger for alarm, however are a sign that further tuning of the hyperparameters is required.

Instance: The Colorization Drawback

Let’s take a look at how momentum accelerates convergence with a concrete instance. On a grid of pixels let

$G$

$E$

$D$

$textual content{decrease}$

$qquad frac{1}{2} sum_{iin D} (w_i – 1)^2$

The colorizer pulls distinguished pixels in direction of 1

$+$

$frac{1}{2} sum_{i,jin E} (w_i – w_j)^2.$

The smoother spreads out the colour

The optimum resolution to this drawback is a vector of all

$1$

$w_{i}^{okay+1}=w_{i}^{okay}-alphasum_{jin N}(w_{i}^{okay}-w_{j}^{okay})-begin{circumstances} alpha(w_{i}^{okay}-1) & iin D 0 & inotin D finish{circumstances}$

This type of native averaging is efficient at smoothing out native variations within the pixels, however poor at making the most of world construction. The updates are akin to a drop of ink, diffusing via water. Motion in direction of equilibrium is made solely via native corrections and so, left undisturbed, its march in direction of the answer is sluggish and laborious. Happily, momentum speeds issues up considerably.

The eigenvectors of the colorization drawback kind a generalized Fourier foundation for

$R^n$

In vectorized kind, the colorization drawback is

$textual content{decrease}$

The smoother’s quadratic kind is the Graph Laplacian

$frac{1}{2}sum_{iin D}left(x^{T}e_{i}e_{i}^{T}x-e_{i}^{T}xright)$

$+$

$frac{1}{2}x^{T}L_{G}x$

And the colorizer is a small low rank correction with a linear time period.

$e_i$

$i^{th}$

The Laplacian matrix,

$L_G$

Small world graphs, like expanders and dense graphs, have glorious conditioning

The conditioning of grids improves with its dimensionality.

And lengthy, wiry graphs, like paths, situation poorly.

These observations carry via to the colorization drawback, and the instinct behind it must be clear. Properly linked graphs enable speedy diffusion of knowledge via the perimeters, whereas graphs with poor connectivity don’t. And this precept, taken to the intense, furnishes a category of features so exhausting to optimize they reveal the boundaries of first order optimization.

The Limits of Descent

Let’s take a step again. We now have, with a intelligent trick, improved the convergence of gradient descent by a quadratic issue with the introduction of a single auxiliary sequence. However is that this the very best we are able to do? May we enhance convergence much more with two sequences? May one maybe select the

$alpha$

$beta$

Sadly, whereas enhancements to the momentum algorithm do exist, all of them run right into a sure, crucial, nearly inescapable decrease certain.

Adventures in Algorithmic Area

To know the boundaries of what we are able to do, we should first formally outline the algorithmic area through which we’re looking. Right here’s one attainable definition. The commentary we’ll make is that each gradient descent and momentum could be “unrolled”. Certainly, since

$start{array}{lll} w^{1} & != & !w^{0} ~-~ alphanabla f(w^{0})[0.35em] w^{2} & != & !w^{1} ~-~ alphanabla f(w^{1})[0.35em] & != & !w^{0} ~-~ alphanabla f(w^{0}) ~-~ alphanabla f(w^{1})[0.35em] & ~ !vdots w^{okay+1} & != & !w^{0} ~-~ alphanabla f(w^{0}) ~-~~~~ cdotscdots ~~~~-~ alphanabla f(w^{okay}) finish{array}$

$w^{okay+1} ~~=~~ w^{0} ~-~ alphasum_i^knabla f(w^{i}).$

The same trick could be finished with momentum:

$w^{okay+1} ~~=~~ w^{0} ~+~ alphasum_i^kfrac{(1-beta^{okay+1-i})}{1-beta}nabla f(w^i).$

Actually, all method of first order algorithms, together with the Conjugate Gradient algorithm, AdaMax, Averaged Gradient and extra, could be written (although not fairly so neatly) on this unrolled kind. Due to this fact the category of algorithms for which

$w^{okay+1} ~~=~~ w^{0} ~+~ sum_{i}^{okay}gamma_{i}^{okay}nabla f(w^{i}) qquad textual content{ for some } gamma_{i}^{okay}$

accommodates momentum, gradient descent and an entire bunch of different algorithms you may dream up. That is what’s assumed in Assumption 2.1.4 [5] of Nesterov. However let’s push this even additional, and broaden this class to permit completely different step-sizes for various instructions.

$w^{okay+1} ~~=~~ w^{0} ~+~ sum_{i}^{okay}Gamma_{i}^{okay}nabla f(w^{i}) quad textual content{ for some diagonal matrix } Gamma_{i}^{okay} .$

This class of strategies covers many of the common algorithms for coaching neural networks, together with ADAM and AdaGrad. We will discuss with this class of strategies as “Linear First Order Strategies”, and we’ll present a single perform all these strategies in the end fail on.

The Resisting Oracle

Earlier, after we talked concerning the colorizer drawback, we noticed that wiry graphs trigger unhealthy conditioning in our optimization drawback. Taking this to its excessive, we are able to take a look at a graph consisting of a single path — a perform so badly conditioned that Nesterov referred to as a variant of it “the worst perform on this planet”. The perform follows the identical construction because the colorizer drawback, and we will name this the Convex Rosenbrock,

$f^n(w)$

$=$

with a colorizer of 1 node

$frac{1}{2}left(w_{1}-1right)^{2}$

$+$

$frac{1}{2}sum_{i=1}^{n}(w_{i}-w_{i+1})^{2}$

robust couplings of adjoining nodes within the path,

$+$

$frac{2}{kappa-1}|w|^{2}.$

and a small regularization time period.

The optimum resolution of this drawback is

$w_{i}^{star}=left(frac{sqrt{kappa}-1}{sqrt{kappa}+1}proper)^{i}$

and the situation variety of the issue

$f^n$

$kappa$

$n$

$w^0 = 0$

Step-size α =

Momentum β =

Right here we see the primary 50 iterates of momentum on the Convex Rosenbrock for

$n=25$

This triangle is a “useless zone” of our iterates. The iterates are at all times 0, it doesn’t matter what the parameters.

The remaining increasing area is the “mild cone” of our iterate’s affect. Momentum does very nicely right here with the optimum parameters.

Error

Weights

The observations made within the above diagram are true for any Linear First Order algorithm. Allow us to show this. First observe that every element of the gradient relies upon solely on the values straight earlier than and after it:

$nabla f(x)_{i}=2w_{i}-w_{i-1}-w_{i+1} +frac{4}{kappa-1} w_{i}, qquad i neq 1.$

Due to this fact the actual fact we begin at 0 ensures that that element should stay stoically there until a component both earlier than or after it turns nonzero. And subsequently, by induction, for any linear first order algorithm,

$start{array}{lllllllll} w^{0} & = & [~~0, & 0, & 0, & ldots & 0, & 0, & ldots & 0~][0.35em] w^{1} & = & [~w_{1}^{1}, & 0, & 0, & ldots & 0, & 0, & ldots & 0~][0.35em] w^{2} & = & [~w_{1}^{2}, & w_{2}^{2}, & 0, & ldots & 0, & 0, & ldots & 0~][0.35em] & ~ vdots w^{okay} & = & [~w_{1}^{k}, & w_{2}^{k}, & w_{3}^{k}, & ldots & w_{k}^{k}, & 0, & ldots & 0~]. finish{array}$

Consider this restriction as a “pace of sunshine” of knowledge switch. Error indicators will take at the least

$okay$

$w_0$

$w_k$

$start{aligned} |w^{okay}-w^{star}|_{infty}&geqmax_{igeq okay+1}{|w_{i}^{star}|}[0.9em]&=left(frac{sqrt{kappa}-1}{sqrt{kappa}+1}proper)^{okay+1}[0.9em]&=left(frac{sqrt{kappa}-1}{sqrt{kappa}+1}proper)^{okay}|w^{0}-w^{star}|_{infty}. finish{aligned}$

$n$

$f^n$

$kappa$

Like many such decrease bounds, this end result should not be taken actually, however spiritually. It, maybe, offers a way of closure and finality to our investigation. However this isn’t the ultimate phrase on first order optimization. This decrease certain doesn’t preclude the chance, for instance, of reformulating the issue to vary the situation quantity itself! There may be nonetheless a lot room for speedups, when you perceive the correct locations to look.

Momentum with Stochastic Gradients

There’s a last level price addressing. All of the dialogue above assumes entry to the true gradient — a luxurious seldom afforded in trendy machine studying. Computing the precise gradient requires a full move over all the info, the price of which could be prohibitively costly. As an alternative, randomized approximations of the gradient, like minibatch sampling, are sometimes used as a plug-in alternative of

$nabla f(w)$

the true gradient

$+$

$textual content{error}(w).$

and an approximation error.
If the estimator is unbiased e.g.

$mathbf{E}[text{error}(w)] = 0$

It’s useful to think about our approximate gradient because the injection of a particular sort of noise into our iteration. And utilizing the equipment developed within the earlier sections, we are able to take care of this further time period straight. On a quadratic, the error time period cleaves cleanly right into a separate time period, the place ¹⁰

$left(start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}proper)$

the noisy iterates are a sum of

$=$

$R^{okay}left(start{array}{c} y_{i}^{0} x_{i}^{0} finish{array}proper)$

the noiseless, deterministic iterates and

$+$

$epsilon^k_i sum_{j=1}^{okay}R^{k-j}left(start{array}{c} 1 -alpha finish{array}proper)$

a decaying sum of the errors, the place

$epsilon^okay = Q cdot textual content{error}(w^okay)$

The error time period,

$epsilon^okay$

$w^okay$

We decompose the anticipated worth of the target worth

$mathbf{E} f(w) – f(w^star)$

The small black dots are a single run of stochastic gradient

Step-size α =

Momentum β =

As [1] observes, the optimization has two phases. Within the preliminary transient section the magnitude of the noise is smaller than the magnitude of the gradient, and Momentum nonetheless makes good progress. Within the second, stochastic section, the noise overwhelms the gradient, and momentum is much less efficient.

Notice that there are a set of unlucky tradeoffs which appear to pit the 2 parts of error towards one another. Reducing the step-size, for instance, decreases the stochastic error, but in addition slows down the speed of convergence. And rising momentum, opposite to common perception, causes the errors to compound. Regardless of these undesirable properties, stochastic gradient descent with momentum has nonetheless been proven to have aggressive efficiency on neural networks. As [1] has noticed, the transient section appears to matter greater than the fine-tuning section in machine studying. And in reality, it has been lately urged [12] that this noise is an efficient factor — it acts as a implicit regularizer, which, like early stopping, prevents overfitting within the fine-tuning section of optimization.

Onwards and Downwards

The research of acceleration is seeing a small revival throughout the optimization neighborhood. If the concepts on this article excite you, it’s possible you’ll want to learn [13], which totally explores the thought of momentum because the discretization of a sure differential equation. However different, much less bodily, interpretations exist. There may be an algebraic interpretation of momentum by way of approximating polynomials [3, 14]. Geometric interpretations are rising [15, 16], connecting momentum to older strategies, just like the Ellipsoid methodology. And eventually, there are interpretations relating momentum to duality [17], maybe offering a clue as learn how to speed up second order strategies and Quasi Newton (for a primary step, see [18]). However just like the proverbial blind males feeling an elephant, momentum looks like one thing greater than the sum of its components. Someday, hopefully quickly, the numerous views will converge right into a satisfying complete.

Acknowledgments

I’m deeply indebted to the editorial contributions of Shan Carter and Chris Olah, with out which this text could be significantly impoverished. Shan Carter offered full redesigns of a lot of my unique interactive widgets, a visible coherence for all of the figures, and invaluable optimizations to the web page’s efficiency. Chris Olah offered impeccable editorial suggestions in any respect ranges of element and abstraction – from the construction of the content material, to the alignment of equations.

I’m additionally grateful to Michael Nielsen for offering the title of this text, which actually tied the article collectively. Marcos Ginestra offered editorial enter for the earliest drafts of this text, and religious encouragement after I wanted it essentially the most. And my gratitude extends to my reviewers, Matt Hoffman and Nameless Reviewer B for his or her astute observations and criticism. I want to thank Reviewer B, specifically, for stating two non-trivial errors within the unique manuscript (dialogue here). The contour plotting library for the hero visualization is the joint work of Ben Frederickson, Jeff Heer and Mike Bostock.

Many due to the quite a few pull requests and points filed on github. Thanks specifically, to Osemwaro Pedro for recognizing an off by one error in one of many equations. And likewise to Dan Schmidt who did an enhancing move over the entire undertaking, correcting quite a few typographical and grammatical errors.

Dialogue and Evaluation

Reviewer A – Matt Hoffman
Reviewer B – Anonymous
Discussion with User derifatives

Footnotes

It’s attainable, nevertheless, to assemble very particular counterexamples the place momentum doesn’t converge, even on convex features. See [4] for a counterexample.
In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
$textual content{decrease}qquadtfrac{1}{2}|Zw-d|^{2}+frac{eta}{2}|w|^{2}=tfrac{1}{2}w^{T}(Z^{T}Z+eta I)w-(Zd)^{T}w$
$Z^{T}Z=Q textual content{diag}(Lambda_{1},ldots,Lambda_{n}) Q^T$
$(Z^{T}Z+eta I)^{-1}(Zd)=Q textual content{diag}left(frac{1}{lambda_{1}+eta},cdots,frac{1}{lambda_{n}+eta}proper)Q^T(Zd)$
$textual content{Tikhonov Regularized } lambda_i = frac{1}{lambda_{i}+eta}=frac{1}{lambda_{i}}left(1-left(1+lambda_{i}/etaright)^{-1}proper).$
$textual content{ Gradient Descent Regularized } lambda_i = frac{1}{lambda_i} left( 1-left(1-alphalambda_{i}proper)^{okay} proper)$
That is true as we are able to write updates in matrix kind as
$left(!!start{array}{cc} 1 & 0 alpha & 1 finish{array}!!proper)Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} 0 & 1 finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)$
which means, by inverting the matrix on the left,
$Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} -alphabeta & 1-alphalambda_{i} finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)=R^{okay+1}left(!!start{array}{c} x_{i}^{0} y_{i}^{0} finish{array}!!proper)$
We are able to write out the convergence charges explicitly. The eigenvalues are
$start{aligned} sigma_{1} & =frac{1}{2}left(1-alphalambda+beta+sqrt{(-alphalambda+beta+1)^{2}-4beta}proper)[0.6em] sigma_{2} & =frac{1}{2}left(1-alphalambda+beta-sqrt{(-alphalambda+beta+1)^{2}-4beta}proper) finish{aligned}$
$(-alphalambda+beta+1)^{2}-4beta<0$
$start{aligned} |sigma_{1}|=|sigma_{2}| & =sqrt{(1-alphalambda+beta)^{2}+|(-alphalambda+beta+1)^{2}-4beta|}=2sqrt{beta} finish{aligned}$
$alphalambda$
$max{|sigma_{1}|,|sigma_{2}|}=tfrac{1}{2}maxleft{ |1-alphalambda_{i}+betapmsqrt{(1-alphalambda_{i}+beta)^{2}-4beta}|proper}$
This may be derived by decreasing the inequalities for all 4 + 1 circumstances within the express type of the convergence charge above.
We should optimize over
$min_{alpha,beta}maxleft{ bigg| ! left(start{array}{cc} beta & lambda_{i} -alphabeta & 1-alphalambda_{i} finish{array}proper) ! bigg|,ldots,bigg| ! left(start{array}{cc} beta & lambda_{n} -alphabeta & 1-alphalambda_{n} finish{array}proper)! bigg|proper}.$
$|cdot |$
The above optimization drawback is bounded from beneath by
$0$
$1$
This may be written explicitly as
$[L_{G}]_{ij}=start{circumstances} textual content{diploma of vertex }i & i=j -1 & ineq j,(i,j)textual content{ or }(j,i)in E 0 & textual content{in any other case} finish{circumstances}$
We use the infinity norm to measure our error, comparable outcomes could be derived for the 1 and a couple of norms.
The momentum iterations are
$start{aligned} z^{okay+1}&=beta z^{okay}+ A w^{okay} + textual content{error}(w^okay) [0.4em] w^{okay+1}&=w^{okay}-alpha z^{okay+1}. finish{aligned}$
$left(!!start{array}{cc} 1 & 0 alpha & 1 finish{array}!!proper)Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} 0 & 1 finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)+left(!!start{array}{c} epsilon_{i}^{okay} 0 finish{array}!!proper)$
$2 instances 2$
On the 1D perform
$f(x)=frac{lambda}{2}x^{2}$
$begin{aligned} mathbf{E}f(x^{k})&=frac{lambda}{2}mathbf{E}[(x^{k})^{2}]&=frac{lambda}{2}mathbf{E}left(e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+epsilon^{okay}e_{2}^{T}sum_{i=1}^{okay}R^{k-i}left(start{array}{c} 1 -alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambda}{2}mathbf{E}left(epsilon^{okay}e_{2}^{T}sum_{i=1}^{okay}R^{k-i}left(start{array}{c} 1 -alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambda}{2}mathbf{E}[epsilon^{k}],cdot,sum_{i=1}^{okay}left(e_{2}^{T}R^{k-i}left(start{array}{c} 1 -alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambdamathbf{E}[epsilon^{k}}{2}cdotsum_{i=1}^{k}gamma_{i}^{2}, qquad gamma_i = e_{2}^{T}R^{k-i}left(begin{array}{c} 1 -alpha end{array}right) end{aligned}$
$mathbf{E} epsilon^k = 0$

References

On the importance of initialization and momentum in deep learning. [PDF]
Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139—1147.
Some strategies of dashing up the convergence of iteration strategies [PDF]
Polyak, B.T., 1964. USSR Computational Arithmetic and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. DOI: 10.1016/0041-5553(64)90137-5
Principle of gradient strategies
Rutishauser, H., 1959. Refined iterative strategies for computation of the answer and the eigenvalues of self-adjoint boundary worth issues, pp. 24—49. Springer. DOI: 10.1007/978-3-0348-7224-9_2
Evaluation and design of optimization algorithms through integral quadratic constraints [PDF]
Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM.
Introductory lectures on convex optimization: A primary course
Nesterov, Y., 2013. , Vol 87. Springer Science & Enterprise Media. DOI: 10.1007/978-1-4419-8853-9
Pure gradient works effectively in studying http://distill.pub/2017/momentum
Amari, S., 1998. Neural computation, Vol 10(2), pp. 251—276. MIT Press. DOI: 10.1162/089976698300017746
Deep Studying, NIPS′2015 Tutorial [PDF]
Hinton, G., Bengio, Y. and LeCun, Y., 2015.
Adaptive restart for accelerated gradient schemes [PDF]
O’Donoghue, B. and Candes, E., 2015. Foundations of computational arithmetic, Vol 15(3), pp. 715—732. Springer. DOI: 10.1007/s10208-013-9150-3
The Nth Energy of a 2×2 Matrix. [PDF]
Williams, Ok., 1992. Arithmetic Journal, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246
From Averaging to Acceleration, There may be Solely a Step-size. [PDF]
Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658—695.
On the momentum time period in gradient descent studying algorithms [PDF]
Qian, N., 1999. Neural networks, Vol 12(1), pp. 145—151. Elsevier. DOI: 10.1016/s0893-6080(98)00116-6
Understanding deep studying requires rethinking generalization [PDF]
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.
A differential equation for modeling Nesterov’s accelerated gradient methodology: Principle and insights [PDF]
Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Data Processing Methods, pp. 2510—2518.
The Zen of Gradient Descent [HTML]
Hardt, M., 2013.
A geometrical various to Nesterov’s accelerated gradient descent [PDF]
Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.
An optimum first order methodology based mostly on optimum quadratic averaging [PDF]
Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.
Linear coupling: An final unification of gradient and mirror descent [PDF]
Allen-Zhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.
Accelerating the cubic regularization of Newton’s methodology on convex issues [PDF]
Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. DOI: 10.1007/s10107-006-0089-x

Updates and Corrections

View all changes to this text because it was first revealed. In case you see a mistake or wish to recommend a change, please create an issue on GitHub.

Citations and Reuse

Diagrams and textual content are licensed beneath Artistic Commons Attribution CC-BY 2.0, except famous in any other case, with the source available on GitHub. The figures which were reused from different sources do not fall beneath this license and could be acknowledged by a observe of their caption: “Determine from …”.

For attribution in educational contexts, please cite this work as

Goh, "Why Momentum Actually Works", Distill, 2017. http://doi.org/10.23915/distill.00006

BibTeX quotation

@article{goh2017why,
  creator = {Goh, Gabriel},
  title = {Why Momentum Actually Works},
  journal = {Distill},
  12 months = {2017},
  url = {http://distill.pub/2017/momentum},
  doi = {10.23915/distill.00006}
}

[ad_2]

Source link

Tags: Momentum works

Decomposing the Error

Selecting A Step-size

Instance: Polynomial Regression

The Dynamics of Momentum

The Crucial Damping Coefficient

Optimum parameters

Instance: The Colorization Drawback

The Limits of Descent

Adventures in Algorithmic Area

The Resisting Oracle

Momentum with Stochastic Gradients

Onwards and Downwards

Video Friday: Peep Handling – IEEE Spectrum

This AI Paper Introduces SELF-REFINE: A Framework For Improving Initial Outputs From LLMs Through Iterative Feedback And Refinement

Editor

This AI Paper Introduces SELF-REFINE: A Framework For Improving Initial Outputs From LLMs Through Iterative Feedback And Refinement

Leave a Reply Cancel reply

Browse by Category

Categories

Recommended

Why Momentum Really Works

First Steps: Gradient Descent

Decomposing the Error

Selecting A Step-size

Instance: Polynomial Regression

The Dynamics of Momentum

The Crucial Damping Coefficient

Optimum parameters

Instance: The Colorization Drawback

The Limits of Descent

Adventures in Algorithmic Area

The Resisting Oracle

Momentum with Stochastic Gradients

Onwards and Downwards

Acknowledgments

Dialogue and Evaluation

Footnotes

References

Updates and Corrections

Citations and Reuse

Video Friday: Peep Handling – IEEE Spectrum

This AI Paper Introduces SELF-REFINE: A Framework For Improving Initial Outputs From LLMs Through Iterative Feedback And Refinement

Editor

This AI Paper Introduces SELF-REFINE: A Framework For Improving Initial Outputs From LLMs Through Iterative Feedback And Refinement

Leave a Reply Cancel reply

Browse by Category

Browse by Tags

Categories

Recommended