Gradient Descent: The Gradient Descent algorithm we tried to understand till now was based on the assumption that ball have to reach the lowest point in the bucket. But what if, bucket has more than one minimum or more specifically bucket has several local minima and one global minima.
There are several ways to overcome above problem. One of the most common ways to overcome above problem is Stochastic Gradient Descent which is used in Artificial Neural Network algorithm. In Neural Network algorithm we try to have large number hidden layers and each hidden node in a layer converges to different pattern in network. This way ANN could try to converge at thousands of local minima and finally try to find out the most optimized minimum point.
In the above figure we could see, ANN neural network tries to converge at several local minima and finally tries to figure the global minima. The randomness introduced in the above logic increases the probability of finding the global minima.
There are numerous gradient descent based optimization algorithms that have been used to optimize neural networks. Few of the most common algorithms are mentioned below.
- Stochastic Gradient Descent: In Stochastic Gradient Descent, rather than finding single optimized combination of weights for whole dataset, we try to get optimized weights combination for each of the row in dataset. This way we get several optimized combinations of weights and probability of finding global minimum increases.
- Nonlinear Conjugate Gradient: This algorithm is very successful in regression
- L-BFGS: This algorithm is used in in classification, uses Hessian approximation and requires the batch gradient.
- Levenberg-Marquardt Algorithm (LMA): This algorithm is one of best for small datasets. Due to complexity of algorithm efficiency hampers as the size of dataset increases