Let us suppose we have rounded bucket with a ball inside it trying to find the bottom of the bucket. This is called optimization i.e. ball is trying to optimize it`s position by finding the lowest point in the bucket.
Let us suppose ball has two options, either to go left or to go right. It has one goal i.e. to reach the lowest point of the bucket. Now ball has to take the decision whether to go left or to right based on the available information. The only information ball has, the “SLOPE” inside the bucket at its current position.
As we can see, if slope is negative (negative gradient) ball should move to right and if slope is positive (positive gradient) ball should move to left. So, slope is the only information which ball needs to reach the minimum or the most optimized position.
Steps to reach the optimized position:
- Calculate the slope at the current position
- If slope is negative move to right
- If slope is positive move to left
- Repeat step 2 and 3 until slope is equal to zero
The above mentioned steps are kind of simplification of the complete gradient descent process. Another important question the ball has to decide upon and i.e. how much ball should move at each step. So steeper the slope farther the ball is away from bottom. Now ball has to leverage this new information. Ball has to decide on the length of the step, so that it reaches to the bottom of bucket with minimum number of steps.
Learning Rate: The size of these steps is called the learning rate. With a high learning rate we can cover more ground each step, but we risk overshooting the lowest point since the slope of the bucket is constantly changing. With a very low learning rate, we can confidently move in the direction of the negative gradient since we are recalculating it so frequently. A low learning rate is more precise, but calculating the gradient is time-consuming, so it will take us a very long time to get to the bottom.
There are number of problems with non-optimized learning rate.
Learning rate is too big: As our decision of learning rate depends upon steepness of the curve. If curve is too steep and accordingly our “Learning rate” is too big then we would overshoot our target optimized position. At times, this vicious cycle of overshooting leads to problem of divergence. So because of large learning rate we diverge rather than converging to minimum position.
Learning rate is too small: If we keep our learning rate too small, we might overcome the problem of overshooting but process of very small learning rate could be too time-consuming.
Problem of Local Minima and Global Minimum:
Sometimes, we might be able to find the minima by reaching a point where gradient or slope is equal to zero which was our primary condition to decide upon optimized position in the bucket.
But as we can see we might get stuck into local minima, which is a minimum value but it might just be local phenomena but not the most optimized position.
So, we will try to understand this problem of local minima and global minima in our next topic of Cost function.