How do you find the best set of weights (W’)
Define the cost function J(W’)
Apply the gradient descent algorithm (iterative algorithm that finds the minimum of the cost function)
Define the cost function J(W’)
It is the sum of losses L(W’) for the misclassified samples
If the sample is misclassified as positive and is actually negative, it is multiplied by -1 and then added to the sum
Is J(W’) always positive or negative
Positive. as L(W’) is always positive because if the sample being added is negative, it is *-1
Define L(W’)
This function measures the distance between a misclassified sample and the decision boundary
The first variable on the RHS is either 1 or -1 depending on the sign of the misclassified sample
Describe gradient descent to minimise the cost function
The gradient of the cost function dJ/dw1 shows the slope of J at any point w1
If the gradient is positive we move left, towards the negative
If the gradient is negative, we move right towards the positive
We are trying to find the minimum of the cost function
What is the formula for the new set of weights using the gradient descent rule
= W’(t) - rate of learning * gradient of the cost function
The gradient of the cost function is a vector of partial derivatives that points in the direction of the steepest increase in cost
What is the main effect of choosing a too large learning rate in gradient descent
The updates may overshoot the minimum and bounce back and forth
Advantage and disadvantage of having a small learning rate
It is more stable but converges more slowly
Formula for the new set of weights using gradient dexcent
Old weights - learning rate * derivative of loss function