why the way we initialize is important
because of the problem of exploding or vanishing gradients. If the gradients are larger than 1 the gradient will be amplified
if under 1 the issue is the same but the gradients will be vanishing and learning difficult
we want to maintain the gradients consistent across the network
when the gradient is big then there is learning and gradient but the magnitudes are so large that is difficult to differentiate the different paths
what is the goal of an initialization?
the general goal is to initialize the weights such that the variance of the activations are the same across every layer. this constant variance helps prevent the gradient from exploding or vanishing
how can we help derive our initializations values to avoid exploding or vanishing gradients?
the can simplify to the following assumptions so we can have a similar variance per layer
1. weights and inputs centered at zero
2. weights and inputs are independent and identically distributed
3. biases are initialized as zeros
inputs also play in these assumptions, so they have to be normalized
what are 2 types of known initializations
he and xavier initialization
what is the xavier initialization
designed for the tanh activation function
utilizes assumption that tanh is approx linear for small inputs
what is the he initialization
designed for the ReLU activation function
also known as kaiming initialization
used by default by pyTorch for linear layer
what is overfitting
when we have high variance
also when train performance significantly better than test performance
validation loss increasing after a certain point
what is underfitting
high bias
train and test loss nearly identical
high loss, no signs of decreasing
how to combate overfitting
increase amount of data (because now we take more steps per epoch so more update per epoch)
utilize data augmentation methods
decrease model size/complexity (fewer hidden layers)
implement early stopping (we stop at the moment at which cost was low)
utilize regularization techniques
how to combate underfitting
increase model size and complexity
train for a longer period of time
reduce regularization techniques
What are in simple regularization techniques?
Regularization = “make overfitting harder.” It’s any trick that nudges a model to generalize better rather than memorize the training set.
what is L1 and L2 regularization
they change the weights
L1 generally pushes many values to 0 (sparser weights)
L2 generally more heavily impacts larger values and doesn’t force many values to 0
what is the other name of L2 regularization
weight decay
how does dropout work
with some specified p, drop any given neuron of the neural network with probability p during a training iteration
do this for each neuron independently
avoids single neurons becoming overly important and helps equalize capacity/power of each neuron
in which stage of development of the neural network we apply dropout
only during training, not during testing
during testing all neurons are turned on and dropout is disabled
what is the main help of dropout
it helps with overfitting mainly by making the model more generalizable. Doesn’t improve model’s capacity, rather focuses on generalizability of the model