Overfitting and Regularization Flashcards

(14 cards)

1
Q

What is overfitting?

A
  • occurs when a model learns the data too well
  • capturing noise and fluctuations, leading to poor generalization on new data
  • might result in the cost function being exactly zero for the training data
  • also called high variance
  • might happen with too many features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is underfitting

A
  • a model does not fit the training data well
  • also called high bias (technical context)
  • unable to capture the pattern in the training data
  • might happen with too little features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an alternative term for overfitting?

A

high variance
because they can end up with highly variable predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an alternative term for underfitting?

A

high bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are approaches to deal with overfitting?

A
  • collecting more data
  • feature selection
  • regularization techniques
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe how collecting more data can help with overfitting

A
  • larger training set will make the algorithm learn a function that is less wringly / has less variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe how feature selection can help with overfitting

A
  • selecting the features that have the highest relevance to the function / subject
  • a lot of feature + not enough data might lead to overfitting
  • one way to do so is using the intuition of what is relevant
  • but some information about the subject are thrown away (everything not selected)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe how regularization techniques can help with overfitting

A
  • eliminating a feature is equal to setting the feature to 0
  • regularization isntead encourages the model to still keep the other features but only set them to a very small value like 0.00001
  • lets you keep all your features
  • keeps features from having an overly large effect
  • by convention we only regularize the parameters w but not b
  • should make little difference
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does regularization work?

A
  • modified cost function which is to be minimized with a penalty for large parameter values
  • min_wb = 1/2m * Sum of (f_wb(xi) - yi)2 + 1000 w32 + 1000 w4**2
  • in order to minimze this cost function the model needs to choose very small values for w3 and w4
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the basic idea of regularization?

A
  • a simpler model is less likely to overfit and therefore smaller values for parameters can be valuable
  • typically all w_j parameters are penalized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How is regularization usually implemented?

A

J(w,b) = 1/2m * Sum over m of (f_wb*(xi)-yi)² + (lambda / 2m) * Sum over n of w_j²

n - features
m - examples
lambda - regularization parameter, >0
By scaling both lambda and the first part of the function (1/2m) its easier to choose a correct lambda
Lambda balances the goal of fitting data and of keeping w_j small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens when lambda is 0 or extremely large?

A
  • lambda = 0 -> overfitting,
  • lambda = 10¹⁰ -> all features are close to 0 -> graph is a horizontal line, underfitting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What changes with regularization for linear regression in terms of gradient descent?

A
  • calculation for w and b remains basically the same except for the derivative of w / b of J_wb because J_wb has changed to include the regularization expression of (lambda/2)* Sum over n of w_j

b does not have to be regularized and remains the same for updated b

updated w = (1/m) * Sum over m of (f_wb(xi)-yi)x_j**i + (lambda/m)w_j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What changes with regularization for logistic regression in terms of gradient descent and cost function?

A

cost function of logistic regression gets added the following term:
+ (lambda/2m) * Sum over n of w_j²

gradient descent for w derivative gets added:
+ (lambda/m) * w_j
-> similiar to regularized linear regression but f(x) is a different function

b will not be regularized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly