What is GD?
Guided search from a random starting point
Optimization problem
What features of the error surface help us with GD?
2. Single global minimum
Steps in GD
1, Starts by selecting a random point within weight space (i.e. each weight is assigned random values within a sensible range).
GD pseudo-code
Require: D
Require: learning rate
Require: errorDelta function
Require: convergence criterion
Batch GD
Stochastic GD
An adjustment to each weight is made based on the error in the prediction made by the candidate model for each training instance.
Learning rate
Determines size of adjustments made to each weight at each step in the process
Small lr: will eventually converge to a global minimum, but will take a long time
Large lr: Cause jumps from one side of the error surface to another, strong change global min will be missed, and may cause the error to increase rather than decrease, leading to a process that will never converge
Initial random weights guidelines
[-0.2, 0.2]
Learning weight decay
Allows the learning rate to start at a large value and then decay over time according to a predefined schedule
lr[t] = lr[0] * (c/(c+T))
where T = current iteration of GD algo
C = Constant controlling how quickly the learning rate decays (often set quite large to 100)
Alternative to GD
An interesting alternative to gradient descent is the population-based training algorithms such as the evolutionary algorithms (EA) and the particle swarm optimisation (PSO).
The basic idea behind population-based approaches is that a population of candidate solutions (NN weight vectors) is created, and the candidate solutions iteratively explore the search space, exchanging information, and eventually converging on a minima.
Because many starting points (candidate solutions) are used, the chances of converging on the global minima are significantly increased. PSO and EA have been shown to perform very competitively, often (albeit not always) outperforming gradient descent on complex NN training problems.
Cons: prone to overfitting by aggressively looking to optimize training criterion
MLP applications
NETTalk - producing phonemes from text
Protein SS prediction - label each amino acids as helix/coil/strand - 64% accuracy
Handwritten digit recognition (LeCun 1990)