Hyperparameter tuning, batch normalization and programming frameworks Flashcards by Marcos Rojas

what are hyperparameters we need to tune

alpha
beta if we have momentum
Adam parameters
number of layers
number of hidden units for the layers
learning rate decay
mini batch size

How well did you know this?

Not at all

Perfectly

what are the most important parameters to tune

the most important is alpha

others are
momentum with 0.9 perhaps
mini batch size
hidden units

the rest are not that important

How well did you know this?

Not at all

Perfectly

what are the usual values for momentum and for adam

for momentum is beta 0,9
for adam beta 1 0,9, beta 2 0,999 and epsilon 10 to power -8

How well did you know this?

Not at all

Perfectly

how do we choose the best values for the hyperparameters

usually the best way is to do it by taking random values because beforehand it is difficult to know the values that will work better
doing it at random this allows us to explore different values and a more richer exploration

How well did you know this?

Not at all

Perfectly

how should we proceed with sampling the potential values of the hyperparameters

in this case, the option could be to sample randomly and then in those cases in which worked better, we can decrease the space and keep on sampling again there

How well did you know this?

Not at all

Perfectly

how do we choose the scale in which we’re going to sample numbers to find the values of the hyperparameters

we need to think about the scale of each of the hyperparameters and then think about the correct scale for it and then determine how we can use distributions to sample inside that scale in a way that is uniform.

For example to sample alpha:
r = -4 * np.random.rand() This gives the value between 0 and -4
Then we can say that alpha will be:
alpha = 10 ** r
So now alpha is between 1 and 10 to the power of -4

How well did you know this?

Not at all

Perfectly

why it doesnt make sense to sample in a linear scale and it is better to sample in a logarithmic scale in the case of tunning hyperparameters

in a linear scale we will move a tiny bit, but in this case when training our models, the magnitude of the numbers is more important, so we prefer to turn the numbers into exponents and sample those exponents instead

How well did you know this?

Not at all

Perfectly

what are the main ways in which people tune the hyperparameters

1) babysitting one model: we see the learning of the model and then we see how it improves, based on this we update the learning rate for example. Every day we check what’s happening and we improve it as we go. Patiently watching the performance. This happens if we don’t have much computational power to train different models.

2) Train many models in parallel. We have different models as at the same time they have different performance. This way we can try different hyperparameters and then at the pick the one with the best performance.

How well did you know this?

Not at all

Perfectly

what is the name of the approaches to tune the hyperparameters

panda: babysiting one model, making sure one is more successful

caviar: many models in parallel, not paying much attention to all of them, some of them will do well

How well did you know this?

Not at all

Perfectly

what does batch normalization does

it takes for a certain layer the mean of the Z and then the sigma and then it normalizes the values of Z with the mean and standard units
with this we use values Z that are normalized for the computations of the neural network
we can normalize the values of the input layer but in this case we’re normalizing the values of Z

The values to normalize Z is done with gamma and beta which are the values for the variance and the mean

How well did you know this?

Not at all

Perfectly

why batch normalization helps

because while learning, each layer’s input keeps on changing in scale and also in shift, this makes learning slow as the next layer has to chase a moving target instead of having things stable
This allows to normalize the inputs of the layers and thus making learning more efficient

How well did you know this?

Not at all

Perfectly

in which step inside a hidden layer does the batch normalization happen

it happens after getting z, we obtain Z and then we change it by normalizing it using beta and gamma, then we get the normalized Z and that is what we pass to the activation function in the same layer and then we compute a of that layer

How well did you know this?

Not at all

Perfectly

when we do batch normalization what are the new parameters we’re adding to a neural network

Basically we’re adding the parameters beta and gamma
this beta is different from momentum and from adam algorithm

How well did you know this?

Not at all

Perfectly

under which condition do we usually apply batch normalization?

we usually do it when using mini batches
we take one mini batch and compute the mean and standard deviation, then we take another batch and then we do it again

How well did you know this?

Not at all

Perfectly

which parameter can we eliminate when doing batch normalization?

we can take out b out of the computation of Z as it will get cancel out in the substraction

How well did you know this?

Not at all

Perfectly

what are the dimensions of beta and gamma in batch normalization for each layer

Study These Flashcards

these are values one per hidden unit so it is n[l], 12

when we use batch normalization, what are the parameters we need to update in the gradient descent?

Study These Flashcards

we have to update the usual ones W and b, but b not needed as it is cancels out
also we need to compute the gradient or derivate for beta and gamma which are the new parameters

why does batch normalization works?

Study These Flashcards

we know it helps in the input layer by normalizing the values of all input features but also helps in the hidden layers
1) it makes the weights later or deeper in the network more robust to changes instead of those in layer one, they will not change much as we’re around mean 0 and variance of 1. It limits the amount to how the amount changed in earlier layers changes later in deeper layers
2) also has a regularization effect: we-re changing the mean and variance depending on the current mini batch, so this adds a bit of noise to the values of Z which is similar to dropout. This has a slight regularization effect.

what do we have to do at test time if we did batch normalization?

Study These Flashcards

at test time we can use just ONE EXAMPLE and not many, so we cannot take the mean and variance of just one example.
We need a separate estimate of mean and variance
we’re going to do it by using an estimate exponentially weighted average across mini batches
we get the mu or mean for each minibatch and layer and the same for the variances
At test time, we compute Z norm using the gamma and beta of the last training batch.
we use exponentially weighted average across mini batches so we can have some estimate of mean and variance of the whole training set and we use those as gamma and beta for the test example

where does the softmax name come from

Study These Flashcards

in contrast to hard max, we’re giving some value to all outputs instead of just giving a 1 to the one with the highest value in the output layer

what is the correct notation for a mini batch

Study These Flashcards

we use curly parenthesis over the matrix name
for example
X {1} this is mini batch one

what can we do if we need the output layer to tell us the probability of different classes

Study These Flashcards

we can think of each class as a number for example clases 0 1 2 3, then the output layer can be a softmax layer

how does a softmax layer work

Study These Flashcards

we basically take the vector Z in the output layer and transform into the probability making all of them sum 1
for this we take the vector, compute t and then use the activation function in which we divide e to the power of the value of that unit divided by the sum of t across the vector

we use the softmax activation function in that layer to make all of that happen basically

what is the loss function when we have a softmax classifier

Study These Flashcards

the loss will be negative sum of yj log yhatj

for each of the examples we know it belongs to one of the classes, but in this case we have percentages, so we want to do it and maximize the percentage of the value that should have been 1 in a specific case
Basically here the loss is a different computation conceptually but similar to the usual one, we want to reduce the loss and maximize learning but with softmax we’re doing it based on an percentage

what are some deep learning frameworks?

caffe, caffe 2 CNTK DL4J Keras Lasagne mxnet PaddlePaddle TensorFlow Theano Torch

how to choose deep learning framework

ease of programming (development and deployment) running speed truly open (open source with good governance)

in the normalization formula in batch normalization, why does the Znorm formula has an epsilon inside the square root in the denominator?

to avoid division by zero

true or false when using batch normalization, we introduce 2 new parameters gamma and beta that must be learned or trained

TRUE

true or false the parameters gamma and beta in batch normalization set the variance and mean of z norm

TRUE gamma is the variance beta is the mean

Hyperparameter tuning, batch normalization and programming frameworks Flashcards

(29 cards)