Hyperparameter tuning, batch normalization and programming frameworks Flashcards

(29 cards)

1
Q

what are hyperparameters we need to tune

A

alpha
beta if we have momentum
Adam parameters
number of layers
number of hidden units for the layers
learning rate decay
mini batch size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what are the most important parameters to tune

A

the most important is alpha

others are
momentum with 0.9 perhaps
mini batch size
hidden units

the rest are not that important

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the usual values for momentum and for adam

A

for momentum is beta 0,9
for adam beta 1 0,9, beta 2 0,999 and epsilon 10 to power -8

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

how do we choose the best values for the hyperparameters

A

usually the best way is to do it by taking random values because beforehand it is difficult to know the values that will work better
doing it at random this allows us to explore different values and a more richer exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how should we proceed with sampling the potential values of the hyperparameters

A

in this case, the option could be to sample randomly and then in those cases in which worked better, we can decrease the space and keep on sampling again there

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how do we choose the scale in which we’re going to sample numbers to find the values of the hyperparameters

A

we need to think about the scale of each of the hyperparameters and then think about the correct scale for it and then determine how we can use distributions to sample inside that scale in a way that is uniform.

For example to sample alpha:
r = -4 * np.random.rand() This gives the value between 0 and -4
Then we can say that alpha will be:
alpha = 10 ** r
So now alpha is between 1 and 10 to the power of -4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

why it doesnt make sense to sample in a linear scale and it is better to sample in a logarithmic scale in the case of tunning hyperparameters

A

in a linear scale we will move a tiny bit, but in this case when training our models, the magnitude of the numbers is more important, so we prefer to turn the numbers into exponents and sample those exponents instead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what are the main ways in which people tune the hyperparameters

A

1) babysitting one model: we see the learning of the model and then we see how it improves, based on this we update the learning rate for example. Every day we check what’s happening and we improve it as we go. Patiently watching the performance. This happens if we don’t have much computational power to train different models.

2) Train many models in parallel. We have different models as at the same time they have different performance. This way we can try different hyperparameters and then at the pick the one with the best performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is the name of the approaches to tune the hyperparameters

A

panda: babysiting one model, making sure one is more successful

caviar: many models in parallel, not paying much attention to all of them, some of them will do well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does batch normalization does

A

it takes for a certain layer the mean of the Z and then the sigma and then it normalizes the values of Z with the mean and standard units
with this we use values Z that are normalized for the computations of the neural network
we can normalize the values of the input layer but in this case we’re normalizing the values of Z

The values to normalize Z is done with gamma and beta which are the values for the variance and the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

why batch normalization helps

A

because while learning, each layer’s input keeps on changing in scale and also in shift, this makes learning slow as the next layer has to chase a moving target instead of having things stable
This allows to normalize the inputs of the layers and thus making learning more efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

in which step inside a hidden layer does the batch normalization happen

A

it happens after getting z, we obtain Z and then we change it by normalizing it using beta and gamma, then we get the normalized Z and that is what we pass to the activation function in the same layer and then we compute a of that layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

when we do batch normalization what are the new parameters we’re adding to a neural network

A

Basically we’re adding the parameters beta and gamma
this beta is different from momentum and from adam algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

under which condition do we usually apply batch normalization?

A

we usually do it when using mini batches
we take one mini batch and compute the mean and standard deviation, then we take another batch and then we do it again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

which parameter can we eliminate when doing batch normalization?

A

we can take out b out of the computation of Z as it will get cancel out in the substraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what are the dimensions of beta and gamma in batch normalization for each layer

A

these are values one per hidden unit so it is n[l], 12

17
Q

when we use batch normalization, what are the parameters we need to update in the gradient descent?

A

we have to update the usual ones W and b, but b not needed as it is cancels out
also we need to compute the gradient or derivate for beta and gamma which are the new parameters

18
Q

why does batch normalization works?

A

we know it helps in the input layer by normalizing the values of all input features but also helps in the hidden layers
1) it makes the weights later or deeper in the network more robust to changes instead of those in layer one, they will not change much as we’re around mean 0 and variance of 1. It limits the amount to how the amount changed in earlier layers changes later in deeper layers
2) also has a regularization effect: we-re changing the mean and variance depending on the current mini batch, so this adds a bit of noise to the values of Z which is similar to dropout. This has a slight regularization effect.

19
Q

what do we have to do at test time if we did batch normalization?

A

at test time we can use just ONE EXAMPLE and not many, so we cannot take the mean and variance of just one example.
We need a separate estimate of mean and variance
we’re going to do it by using an estimate exponentially weighted average across mini batches
we get the mu or mean for each minibatch and layer and the same for the variances
At test time, we compute Z norm using the gamma and beta of the last training batch.
we use exponentially weighted average across mini batches so we can have some estimate of mean and variance of the whole training set and we use those as gamma and beta for the test example

20
Q

where does the softmax name come from

A

in contrast to hard max, we’re giving some value to all outputs instead of just giving a 1 to the one with the highest value in the output layer

20
Q

what is the correct notation for a mini batch

A

we use curly parenthesis over the matrix name
for example
X {1} this is mini batch one

21
Q

what can we do if we need the output layer to tell us the probability of different classes

A

we can think of each class as a number for example clases 0 1 2 3, then the output layer can be a softmax layer

21
Q

how does a softmax layer work

A

we basically take the vector Z in the output layer and transform into the probability making all of them sum 1
for this we take the vector, compute t and then use the activation function in which we divide e to the power of the value of that unit divided by the sum of t across the vector

we use the softmax activation function in that layer to make all of that happen basically

22
Q

what is the loss function when we have a softmax classifier

A

the loss will be negative sum of yj log yhatj

for each of the examples we know it belongs to one of the classes, but in this case we have percentages, so we want to do it and maximize the percentage of the value that should have been 1 in a specific case
Basically here the loss is a different computation conceptually but similar to the usual one, we want to reduce the loss and maximize learning but with softmax we’re doing it based on an percentage

23
what are some deep learning frameworks?
caffe, caffe 2 CNTK DL4J Keras Lasagne mxnet PaddlePaddle TensorFlow Theano Torch
24
how to choose deep learning framework
ease of programming (development and deployment) running speed truly open (open source with good governance)
25
in the normalization formula in batch normalization, why does the Znorm formula has an epsilon inside the square root in the denominator?
to avoid division by zero
26
true or false when using batch normalization, we introduce 2 new parameters gamma and beta that must be learned or trained
TRUE
27
true or false the parameters gamma and beta in batch normalization set the variance and mean of z norm
TRUE gamma is the variance beta is the mean