[ad_1]
Right here’s a preferred story about momentum
This commonplace story isn’t fallacious, however it fails to clarify many essential behaviors of momentum. Actually, momentum could be understood way more exactly if we research it on the correct mannequin.
One good mannequin is the convex quadratic. This mannequin is wealthy sufficient to breed momentum’s native dynamics in actual issues, and but easy sufficient to be understood in closed kind. This stability offers us highly effective traction for understanding this algorithm.
We start with gradient descent. The algorithm has many virtues, however pace will not be considered one of them. It’s easy — when optimizing a easy perform
$f$
$w^{okay+1} = w^kalphanabla f(w^okay).$
For a stepsize sufficiently small, gradient descent makes a monotonic enchancment at each iteration. It at all times converges, albeit to a neighborhood minimal. And beneath a number of weak curvature situations it could even get there at an exponential charge.
However the exponential lower, although interesting in principle, can usually be infuriatingly small. Issues usually start fairly nicely — with a powerful, nearly rapid lower within the loss. However because the iterations progress, issues begin to decelerate. You begin to get a nagging feeling you’re not making as a lot progress as you ought to be. What has gone fallacious?
The issue could possibly be the optimizer’s outdated nemesis, pathological curvature. Pathological curvature is, merely put, areas of
$f$
Momentum proposes the next tweak to gradient descent. We give gradient descent a shortterm reminiscence:
$start{aligned} z^{okay+1}&=beta z^{okay}+nabla f(w^{okay})[0.4em] w^{okay+1}&=w^{okay}alpha z^{okay+1} finish{aligned}$
The change is harmless, and prices nearly nothing. When
$beta = 0$
$beta = 0.99$
$0.999$
Optimizers name this minor miracle “acceleration”.
The brand new algorithm could appear at first look like an affordable hack. A easy trick to get round gradient descent’s extra aberrant habits — a smoother for oscillations between steep canyons. However the fact, if something, is the opposite manner spherical. It’s gradient descent which is the hack. First, momentum offers as much as a quadratic speedup on many features.
However there’s extra. A decrease certain, courtesy of Nesterov
First Steps: Gradient Descent
We start by finding out gradient descent on the only mannequin attainable which isn’t trivial — the convex quadratic,
$f(w) = tfrac{1}{2}w^TAw – b^Tw, qquad w in mathbf{R}^n.$
Assume
$A$
$w^{star}$
$w^{star} = A^{1}b.$
Easy as this mannequin could also be, it’s wealthy sufficient to approximate many features (consider
$A$
That is the way it goes. Since
$nabla f(w)=Aw – b$
$w^{okay+1}=w^{okay} alpha (Aw^{okay} – b).$
Right here’s the trick. There’s a very pure area to view gradient descent the place all the size act independently — the eigenvectors of
$A$
Each symmetric matrix
$A$
$A=Q textual content{diag}(lambda_{1},ldots,lambda_{n}) Q^{T},qquad Q = [q_1,ldots,q_n],$
and, as per conference, we’ll assume that the
$lambda_i$
$lambda_1$
$lambda_n$
$x^{okay} = Q^T(w^{okay} – w^star)$
$start{aligned} x_{i}^{okay+1} & =x_{i}^{okay}alpha lambda_ix_{i}^{okay} [0.4em] &= (1alphalambda_i)x^k_i=(1alpha lambda_i)^{okay+1}x^0_i finish{aligned}$
Transferring again to our unique area
$w$
$w^okay – w^star = Qx^okay=sum_i^n x^0_i(1alphalambda_i)^okay q_i$
and there we have now it — gradient descent in closed kind.
Decomposing the Error
The above equation admits a easy interpretation. Every ingredient of
$x^0$
$Q$
$n$
$1alphalambda_i$
$1$
For many stepsizes, the eigenvectors with largest eigenvalues converge the quickest. This triggers an explosion of progress within the first few iterations, earlier than issues decelerate because the smaller eigenvectors’ struggles are revealed. By writing the contributions of every eigenspace’s error to the loss
$f(w^{okay})f(w^{star})=sum(1alphalambda_{i})^{2k}lambda_{i}[x_{i}^{0}]^2$
Selecting A Stepsize
The above evaluation offers us rapid steering as to learn how to set a stepsize
$alpha$
$1alpha lambda_i$
$0<alphalambda_i<2.$
The general convergence charge is set by the slowest error element, which should be both
$lambda_1$
$lambda_n$
$start{aligned}textual content{charge}(alpha) & ~=~ max_{i}left1alphalambda_{i}proper[0.9em] & ~=~ maxleft{1alphalambda_{1},~ 1alphalambda_{n}proper} finish{aligned}$
This total charge is minimized when the charges for
$lambda_1$
$lambda_n$
$start{aligned} textual content{optimum }alpha ~=~{mathop{textual content{argmin}}limits_alpha} ~textual content{charge}(alpha) & ~=~frac{2}{lambda_{1}+lambda_{n}}[1.4em] textual content{optimum charge} ~=~{min_alpha} ~textual content{charge}(alpha) & ~=~frac{lambda_{n}/lambda_{1}1}{lambda_{n}/lambda_{1}+1} finish{aligned}$
Discover the ratio
$lambda_n/lambda_1$
$textual content{situation quantity} := kappa :=frac{lambda_n}{lambda_1}$
$A^{1}b$
$b$
$kappa = 1$
Instance: Polynomial Regression
The above evaluation reveals an perception: all errors usually are not made equal. Certainly, there are completely different sorts of errors,
$n$
$A$
$A$
Lets see how this performs out in polynomial regression. Given 1D information,
$xi_i$
$textual content{mannequin}(xi)=w_{1}p_{1}(xi)+cdots+w_{n}p_{n}(xi)qquad p_{i}=ximapstoxi^{i1}$
to our observations,
$d_i$
$xi$
Due to the linearity, we are able to match this mannequin to our information
$xi_i$
$textual content{decrease}_w qquadtfrac{1}{2}sum_i (textual content{mannequin}(xi_{i})d_{i})^{2} ~~=~~ tfrac{1}{2}Zw – d^2$
$Z=left(start{array}{ccccc} 1 & xi_{1} & xi_{1}^{2} & ldots & xi_{1}^{n1} 1 & xi_{2} & xi_{2}^{2} & ldots & xi_{2}^{n1} vdots & vdots & vdots & ddots & vdots 1 & xi_{m} & xi_{m}^{2} & ldots & xi_{m}^{n1} finish{array}proper).$
The trail of convergence, as we all know, is elucidated after we view the iterates within the area of
$Q$
$Z^T Z$
$Q$
$w$
$Qw$
$p$
$bar{p}$
$textual content{mannequin}(xi)~=~x_{1}bar{p}_{1}(xi)~+~cdots~+~x_{n}bar{p}_{n}(xi)qquad bar{p}_{i}=sum q_{ij}p_j.$
This mannequin is similar to the outdated one. However these new options
$bar{p}$
$n$
Acknowledgments
I’m deeply indebted to the editorial contributions of Shan Carter and Chris Olah, with out which this text could be significantly impoverished. Shan Carter offered full redesigns of a lot of my unique interactive widgets, a visible coherence for all of the figures, and invaluable optimizations to the web page’s efficiency. Chris Olah offered impeccable editorial suggestions in any respect ranges of element and abstraction – from the construction of the content material, to the alignment of equations.
I’m additionally grateful to Michael Nielsen for offering the title of this text, which actually tied the article collectively. Marcos Ginestra offered editorial enter for the earliest drafts of this text, and religious encouragement after I wanted it essentially the most. And my gratitude extends to my reviewers, Matt Hoffman and Nameless Reviewer B for his or her astute observations and criticism. I want to thank Reviewer B, specifically, for stating two nontrivial errors within the unique manuscript (dialogue here). The contour plotting library for the hero visualization is the joint work of Ben Frederickson, Jeff Heer and Mike Bostock.
Many due to the quite a few pull requests and points filed on github. Thanks specifically, to Osemwaro Pedro for recognizing an off by one error in one of many equations. And likewise to Dan Schmidt who did an enhancing move over the entire undertaking, correcting quite a few typographical and grammatical errors.
Dialogue and Evaluation
Reviewer A – Matt Hoffman
Reviewer B – Anonymous
Discussion with User derifatives
Footnotes
 It’s attainable, nevertheless, to assemble very particular counterexamples the place momentum doesn’t converge, even on convex features. See
[4] for a counterexample.  In Tikhonov Regression we add a quadratic penalty to the regression, minimizing
$textual content{decrease}qquadtfrac{1}{2}Zwd^{2}+frac{eta}{2}w^{2}=tfrac{1}{2}w^{T}(Z^{T}Z+eta I)w(Zd)^{T}w$
$Z^{T}Z=Q textual content{diag}(Lambda_{1},ldots,Lambda_{n}) Q^T$
$(Z^{T}Z+eta I)^{1}(Zd)=Q textual content{diag}left(frac{1}{lambda_{1}+eta},cdots,frac{1}{lambda_{n}+eta}proper)Q^T(Zd)$
$textual content{Tikhonov Regularized } lambda_i = frac{1}{lambda_{i}+eta}=frac{1}{lambda_{i}}left(1left(1+lambda_{i}/etaright)^{1}proper).$
$textual content{ Gradient Descent Regularized } lambda_i = frac{1}{lambda_i} left( 1left(1alphalambda_{i}proper)^{okay} proper)$

That is true as we are able to write updates in matrix kind as
$left(!!start{array}{cc} 1 & 0 alpha & 1 finish{array}!!proper)Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} 0 & 1 finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)$
which means, by inverting the matrix on the left,
$Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} alphabeta & 1alphalambda_{i} finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)=R^{okay+1}left(!!start{array}{c} x_{i}^{0} y_{i}^{0} finish{array}!!proper)$

We are able to write out the convergence charges explicitly. The eigenvalues are
$start{aligned} sigma_{1} & =frac{1}{2}left(1alphalambda+beta+sqrt{(alphalambda+beta+1)^{2}4beta}proper)[0.6em] sigma_{2} & =frac{1}{2}left(1alphalambda+betasqrt{(alphalambda+beta+1)^{2}4beta}proper) finish{aligned}$
$(alphalambda+beta+1)^{2}4beta<0$
$start{aligned} sigma_{1}=sigma_{2} & =sqrt{(1alphalambda+beta)^{2}+(alphalambda+beta+1)^{2}4beta}=2sqrt{beta} finish{aligned}$
$alphalambda$
$max{sigma_{1},sigma_{2}}=tfrac{1}{2}maxleft{ 1alphalambda_{i}+betapmsqrt{(1alphalambda_{i}+beta)^{2}4beta}proper}$
 This may be derived by decreasing the inequalities for all 4 + 1 circumstances within the express type of the convergence charge above.
 We should optimize over
$min_{alpha,beta}maxleft{ bigg ! left(start{array}{cc} beta & lambda_{i} alphabeta & 1alphalambda_{i} finish{array}proper) ! bigg,ldots,bigg ! left(start{array}{cc} beta & lambda_{n} alphabeta & 1alphalambda_{n} finish{array}proper)! biggproper}.$
$cdot $
 The above optimization drawback is bounded from beneath by
$0$
$1$
 This may be written explicitly as
$[L_{G}]_{ij}=start{circumstances} textual content{diploma of vertex }i & i=j 1 & ineq j,(i,j)textual content{ or }(j,i)in E 0 & textual content{in any other case} finish{circumstances}$
 We use the infinity norm to measure our error, comparable outcomes could be derived for the 1 and a couple of norms.

The momentum iterations are
$start{aligned} z^{okay+1}&=beta z^{okay}+ A w^{okay} + textual content{error}(w^okay) [0.4em] w^{okay+1}&=w^{okay}alpha z^{okay+1}. finish{aligned}$
$left(!!start{array}{cc} 1 & 0 alpha & 1 finish{array}!!proper)Bigg(!!start{array}{c} y_{i}^{okay+1} x_{i}^{okay+1} finish{array}!!Bigg)=left(!!start{array}{cc} beta & lambda_{i} 0 & 1 finish{array}!!proper)left(!!start{array}{c} y_{i}^{okay} x_{i}^{okay} finish{array}!!proper)+left(!!start{array}{c} epsilon_{i}^{okay} 0 finish{array}!!proper)$
$2 instances 2$
 On the 1D perform
$f(x)=frac{lambda}{2}x^{2}$
$begin{aligned} mathbf{E}f(x^{k})&=frac{lambda}{2}mathbf{E}[(x^{k})^{2}]&=frac{lambda}{2}mathbf{E}left(e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+epsilon^{okay}e_{2}^{T}sum_{i=1}^{okay}R^{ki}left(start{array}{c} 1 alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambda}{2}mathbf{E}left(epsilon^{okay}e_{2}^{T}sum_{i=1}^{okay}R^{ki}left(start{array}{c} 1 alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambda}{2}mathbf{E}[epsilon^{k}],cdot,sum_{i=1}^{okay}left(e_{2}^{T}R^{ki}left(start{array}{c} 1 alpha finish{array}proper)proper)^{2}&=frac{lambda}{2}e_{2}^{T}R^{okay}left(start{array}{c} y^{0} x^{0} finish{array}proper)+frac{lambdamathbf{E}[epsilon^{k}}{2}cdotsum_{i=1}^{k}gamma_{i}^{2}, qquad gamma_i = e_{2}^{T}R^{ki}left(begin{array}{c} 1 alpha end{array}right) end{aligned}$
$mathbf{E} epsilon^k = 0$
References
 On the importance of initialization and momentum in deep learning. [PDF]
Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139—1147.  Some strategies of dashing up the convergence of iteration strategies [PDF]
Polyak, B.T., 1964. USSR Computational Arithmetic and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. DOI: 10.1016/00415553(64)901375  Principle of gradient strategies
Rutishauser, H., 1959. Refined iterative strategies for computation of the answer and the eigenvalues of selfadjoint boundary worth issues, pp. 24—49. Springer. DOI: 10.1007/9783034872249_2  Evaluation and design of optimization algorithms through integral quadratic constraints [PDF]
Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM.  Introductory lectures on convex optimization: A primary course
Nesterov, Y., 2013. , Vol 87. Springer Science & Enterprise Media. DOI: 10.1007/9781441988539  Pure gradient works effectively in studying http://distill.pub/2017/momentum
Amari, S., 1998. Neural computation, Vol 10(2), pp. 251—276. MIT Press. DOI: 10.1162/089976698300017746  Deep Studying, NIPS′2015 Tutorial [PDF]
Hinton, G., Bengio, Y. and LeCun, Y., 2015.  Adaptive restart for accelerated gradient schemes [PDF]
O’Donoghue, B. and Candes, E., 2015. Foundations of computational arithmetic, Vol 15(3), pp. 715—732. Springer. DOI: 10.1007/s1020801391503  The Nth Energy of a 2×2 Matrix. [PDF]
Williams, Ok., 1992. Arithmetic Journal, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246  From Averaging to Acceleration, There may be Solely a Stepsize. [PDF]
Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658—695.  On the momentum time period in gradient descent studying algorithms [PDF]
Qian, N., 1999. Neural networks, Vol 12(1), pp. 145—151. Elsevier. DOI: 10.1016/s08936080(98)001166  Understanding deep studying requires rethinking generalization [PDF]
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.  A differential equation for modeling Nesterov’s accelerated gradient methodology: Principle and insights [PDF]
Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Data Processing Methods, pp. 2510—2518.  The Zen of Gradient Descent [HTML]
Hardt, M., 2013.  A geometrical various to Nesterov’s accelerated gradient descent [PDF]
Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.  An optimum first order methodology based mostly on optimum quadratic averaging [PDF]
Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.  Linear coupling: An final unification of gradient and mirror descent [PDF]
AllenZhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.  Accelerating the cubic regularization of Newton’s methodology on convex issues [PDF]
Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. DOI: 10.1007/s101070060089x
Updates and Corrections
View all changes to this text because it was first revealed. In case you see a mistake or wish to recommend a change, please create an issue on GitHub.
Citations and Reuse
Diagrams and textual content are licensed beneath Artistic Commons Attribution CCBY 2.0, except famous in any other case, with the source available on GitHub. The figures which were reused from different sources do not fall beneath this license and could be acknowledged by a observe of their caption: “Determine from …”.
For attribution in educational contexts, please cite this work as
Goh, "Why Momentum Actually Works", Distill, 2017. http://doi.org/10.23915/distill.00006
BibTeX quotation
@article{goh2017why, creator = {Goh, Gabriel}, title = {Why Momentum Actually Works}, journal = {Distill}, 12 months = {2017}, url = {http://distill.pub/2017/momentum}, doi = {10.23915/distill.00006} }
[ad_2]
Source link