Learning with Machines on large multidimensional datasets can be computationally intensive especially if significant batches of data samples are necessary or used for each learning step - as in batch gradient descent for perceptron training. However, if at each learning step, the learning is not differentiable, we end up with an iteration of next to similar steps on the gradient slope which do not significantly say a lot about the learning machine.
Let’s first understand what Gradient descent is, variations of it and how stochastic approximation can come into play. Using a simple explanation given in the Deep Learning Book(Ian Goodfellow et al);
Consider an objective function \(f(x)\) that we want to optimize i.e. minimize or maximise it by altering x.
We obtain its derivative \(f^1(x)\) which gives the slope of \(f(x)\) at point x, thus we can scale a change in input to obtain a corresponding change in output. For examples \(f(x +\in) \sim f(x) + \in f^1(x)\). Therefore, to minimize the function \(f(x)\), we can move small steps in the opposite direction of the derivative \(f^1(x)\). That is basically gradient descent.
The objective function and its derivative values represent crucial parts of the gradient learning ; \(f^1(x) = 0 \), which are referred to as critical or stationary points which provide no information about direction of gradient and these points fall within the following categories;
- Local minima- refers to a point lower than all neighboring points
- Local Maxima - refers to a point higher than all neighboring points
- Saddle points - Neither a maxima or minima
- Global Minima - is the lowest value of f(x) throughout the input space and it is highly possible for a local minima not to be globally optimum
Beyond and/or above the local maxima and minima it is not possible to increase and/or decrease the value of \(f(x)\).
Gradient Update rule
The general Gradient Update rule can be defined as follows;
\[W_{t+1} = w_t - \eta \bigtriangledown E\mid w_t\]In this case, we consider perceptron learning of weights(Haykin has a great explanation) over \(t\) learning steps and \(E\) is the cost function. Therefore the next weight \(w_{t+1}\) can be estimated by the current \(w_t\) minus a product of the learning rate \(\eta\) and a derivative of the cost function \(E\) w.r.t the weight.
The negative gradient gives the direction of the steepest descent in E.
Further we will explore the types of gradient descent; Batch which was mentioned earlier and online learning which processes each observation, one at a time, updating the gradient after learning step. Online learning has several advantages over the Batch learning i.e no need to store complete dataset, easy adaptability in cases of changing data and many more.
Learning gradient descent with stochastic approximation is basically online learning but the singular examples are selected randomly with equal probability at each learning update for \(tmax \cdot P\) learning steps(P is the number of observations in the dataset). Therefore the true gradient of the cost function \(E\) is approximated by a single example.
Below are my Matlab code samples that can be refactored into any programming langauge.
sgd.m
cost.m
The cost function and generalisation error values can be calculated after \(P\) randomised steps to monitor the learning performance. As seen below, we can see plateau states in the cost function after certain learning steps. At these points there isn’t much learning happening therefore the generalisation error will significantly increase. This behavior can also explain poor generalisation and overfitting at the plateau states.
Figure shows plateau states from Stochastic Gradient descent
Figure shows final weights from Stochastic Gradient descent. The weights signify the importance of the features
Further reading/ Reference material
-Deep learning by Ian Goodfellow, Yoshua Bengio, Aaron C. Courville
-The Elements of statistical learning by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
-Neural networks and learning machines by Simon S. Haykin