Which optimizations can you do prior-training?
Which optimizations can you do during training?
Which optimization can you do when computing the loss?
How can we optimize the training procedure (while searching for the best solution)
By having a variable learning rate
What is input normalization?
It is a prior-training optimization
To what problem is the Xavier/Glorot initialization a solution?
When initializing the weights of the network, the common practice was to initialize randomly from a normal distribution.
The problem: large variance - var(z)
What was the Xavier/Glorot solution?
Make the weights smaller by doing var(z) = 1/n
Therefor
weight_i = weight_i x sqrt(1/n)
What problem does using dropout tackle?
The decrease dependence of a given feature
What is batch normalization?
It is an optimization during training technique.
–> Normalize internal activation by considering dataset statistics
–> stochastic optimization - batch-level statistics
What problem does batch normalization tackle?
During training, updates on weights at a later layer should take into account changes at earlier layers (covariance shift)
-> introduce changes in the distribution of internal activations
-> requires careful initialization and a small learning rate
What are the benefits of using batch normalization?
What could be a potential problem/weakness with the gradient descent as how we have seen it so far? And how do we tackle it?
If 80% of the examples are from one class then the model will learn the important features of that class, this is because the update process of the weights is dominated by the majority of examples
Tackle this by having weighted examples
What does Focal Loss do?
What problem does focal loss tackle?
When the dataset is balanced (so 50-50 for example) but some class has more difficult features to learn (more details, more small/fine points)
Where could focal loss be usefull?
How does focal loss work?
It is an extra parameter that will increase the loss for examples that are harder to classify, and therefor forcing the model to train on those examples.
It is based on the probability that the model guessed the label correctly. The higher that value, the less influence this focal loss has. So for very uncertain examples, the loss is high and the model will be pushed on them
What is Triplet Loss?
How is triplet loss different?
With normal loss we compare prediction to ground truth (original label)
With triplet loss, we use three examples, and compare distance.
Anchor and positive should share the same class
What is the idea behind using multiple loss functions?
Object localization
–> regularize a high-performing classifier to enable localization
Why would we opt to use a variable learning rate? (Annealing)
As training progresses, taken steps might be to large to reach the optimum (when using a fixed learning rate)
In self-supervised learning, we have the problem that data annotation is expensive. What could be a solution to this?
Supervise using labels generated from data (without manual annotation)