a) Why is SVM a discriminative algorithm?
b) What does a hyperplane mean?
a)
SVM is a discriminative algo because it seeks to find decision boundaries and separating planes between classes
b)
hyperplane is a linear decision boundary that separates two classes in feature space
Why is the SVM goal?
SVM chooses the best plane
Goal:
- separating hyperplanes between classes max their dist from the closest data-points of their respective classes
(line must have maximal dist from pts in either classes)
a) Explain the general equation of a hyperplane and margin
b) What does SVM try to maximize?
c) What do support vectors indicate?
d) What does a max-margin hyperplane mean?
a)
w*x + ϐ = 0 (w is matrix of features β)
- w is a weight vector (pts perpendicular to hyperplane)
- x is a point
- ϐ is the bias (intercept) (shifts hyperplane away from origin along normal dir)
special case (2 features)
{x1,x2}: β + β1x1 + β2x2 = 0
- straight line
b)
SVM max margin to improve classification accuracy
- margin: dist between hyperplane and nearest data pt from each class
c)
training instnaces that lie closests to the hyperplane
- ex. pixels at the boundary between ‘tumour’ and ‘normal’ tissue
d)
hyperplane w/ max margin or dist to the closest pts
Explain the following:
“Maximizing the margin of a hyperplane is equivalent to minimizing the ||W||”
||w|| = length (magnitude) of the weight vector
make perpendicular distance from the closest training points to the plane as large as possible
a) Why is SVM a dual optimization problem?
b) What are two optimization goals?
a)
SVM seeks to address 2 things:
1. ensure classification is correct
2. maximize the margin
b)
1. Min
- minimizing the number of misclassified points –> finding the optimal location of the decision boundary
a) How is the SVM algo and the eq change to deal with non-separable cases?
b) What does the slack variable indicate?
a)
Separable cases:
yi(w*xi + b) ≥ 1
–> yi = label
–> xi = feature vector
–> w = weight (normal vector)
–> b = bias
correct classification:
yi(wxi + b) ≥ 1
- (correct and outside the margin)
1 ≥ yi(wxi + b) ≥ 0
- (correct but within the margin)
0 ≥ yi(w*xi + b)
- (incorrect and / or on the margin)
non-separable :
yi(w*xi + b) ≥ 1 - ε
b)
slack variable (ε)
- allows some variables to be close to the margin
- selects the hyperplane that min empirical error
a) What is the role of C in a generalized form of the SVM algo (to deal w/ separable cases?)
b) What are the trade-offs in choosing large/ small values of C?
c) How do we choose the best C?
a)
cost parameter that weights the penalty for:
- misclassification
- margin-violating training points
b)
large C
- strong penalty on errors
- makes ε small
Pro: low training error
Con: risk of overfitting and poor generalization
small C
- weak penalty on error
- tolerates larger ε
Pro: robustness to noise and low variance
Con: underfit if C is too small and model ignores useful structure
c)
N-fold validation to find C
In a soft margin SVM, what does the hyperparameter C control?
trade-off between max marging and min classification error