Initialization and regularization Flashcards

(17 cards)

1
Q

why the way we initialize is important

A

because of the problem of exploding or vanishing gradients. If the gradients are larger than 1 the gradient will be amplified

if under 1 the issue is the same but the gradients will be vanishing and learning difficult

we want to maintain the gradients consistent across the network

when the gradient is big then there is learning and gradient but the magnitudes are so large that is difficult to differentiate the different paths

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the goal of an initialization?

A

the general goal is to initialize the weights such that the variance of the activations are the same across every layer. this constant variance helps prevent the gradient from exploding or vanishing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

how can we help derive our initializations values to avoid exploding or vanishing gradients?

A

the can simplify to the following assumptions so we can have a similar variance per layer
1. weights and inputs centered at zero
2. weights and inputs are independent and identically distributed
3. biases are initialized as zeros

inputs also play in these assumptions, so they have to be normalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are 2 types of known initializations

A

he and xavier initialization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is the xavier initialization

A

designed for the tanh activation function
utilizes assumption that tanh is approx linear for small inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is the he initialization

A

designed for the ReLU activation function
also known as kaiming initialization
used by default by pyTorch for linear layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is overfitting

A

when we have high variance
also when train performance significantly better than test performance
validation loss increasing after a certain point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is underfitting

A

high bias
train and test loss nearly identical
high loss, no signs of decreasing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how to combate overfitting

A

increase amount of data (because now we take more steps per epoch so more update per epoch)
utilize data augmentation methods
decrease model size/complexity (fewer hidden layers)
implement early stopping (we stop at the moment at which cost was low)
utilize regularization techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how to combate underfitting

A

increase model size and complexity
train for a longer period of time
reduce regularization techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are in simple regularization techniques?

A

Regularization = “make overfitting harder.” It’s any trick that nudges a model to generalize better rather than memorize the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what is L1 and L2 regularization

A

they change the weights
L1 generally pushes many values to 0 (sparser weights)
L2 generally more heavily impacts larger values and doesn’t force many values to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is the other name of L2 regularization

A

weight decay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how does dropout work

A

with some specified p, drop any given neuron of the neural network with probability p during a training iteration
do this for each neuron independently
avoids single neurons becoming overly important and helps equalize capacity/power of each neuron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

in which stage of development of the neural network we apply dropout

A

only during training, not during testing
during testing all neurons are turned on and dropout is disabled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is the main help of dropout

A

it helps with overfitting mainly by making the model more generalizable. Doesn’t improve model’s capacity, rather focuses on generalizability of the model