Libra Vs China. Round 1.

You may not have noticed but a giant digital global boxing fight has just begun and we’re about to go 14 rounds. There will be blood. Let me explain — The prize for this boxing match is not ‘free’…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How Does Stochastic Gradient Descent Find the Global Minima?

Stochastic Gradient Descent is known for the randomness that it introduces while iterating over the best parameter set to fit it’s training set. But does this only mean that the algorithm infuses only speed to find a good set of parameters?

Actually mathematics reveals one more advantage of SGD: it severely smoothens any continuously differentiable complex cost function having several local minima.

Though convex optimisation proofs deal with the mathematical part more rigorously, here we shall get a intuitive feel about why and where does this happen.

Modern neural networks inherently work on more abstract complexity resulting in loss functions being highly non-convex and having every type of uncomfortable variations: peaks, saddle points and local minima. Finding the global minima is the greater concern here as that would take us closer to the optimal hyper-parameters.

The true gradient descent follows this update algorithm: x(t+1)= x(t)- η ∇ f(x,θ)

As you an see, the descent is negative of the loss function’s gradient (- η ∇ f(x,θ)) and hence has greater probability to fall into one of those saddle points or local minima.

In SGD, the descent isn’t truly along gradient (that’s why you will see ‘Gradient’ in SGD sounds misnomer here), but along a deviated direction: x(t+1)= x(t)- η (∇ f(x,θ)+w(t)) where w(t)~W(x) which is a random distribution of samples.

The additional w(t) term makes the algorithm take a much different course than that of the loss function’s gradient.

Here -η(v(t))= -η(∇ f(x,θ)+w(t)). Thus, the red line being along the gradient of f(x,θ) gets you towards the local minima x0. The intuition behind the w(t) is to add some form of noise i.e. a completely random distribution of samples W(x).

This actually convolves the loss function into a more smoothened one i.e. the shallower valleys in the landscape (the local minima here) get insignificant and hence the net gradient is directed towards the global minima most of the time.

So, we see that the grey line towards -η(v(t)) gets away from the local minima altogether. Thus the noisiness in SGD actually helps it to get away from the trouble areas in loss function.

So here are some right hand rules when to use SGD and when to avoid:

So, I hope you got a better intuition into the dynamics behind the famed Stochastic Gradient Descent and it’s usefulness: both computationally efficient as well as a guaranteed global minima while training neural networks.

Add a comment

Related posts:

Meeting Scheduler App

Business matchmaking software from EventBrew enables face-to-face meetings at your conference or tradeshow. Create a tailored experience for each of your attendees at your event.

The Four Types of Reading

Four types of reading can be used for absorbing varying amounts of information. They all have their pros and cons, so it’s essential to know when and how to use each type. Most of these techniques…

Old School Romance

I miss the kind of love,. “Old School Romance” is published by Renju s kumar.