Loss surface geometries difficult for optimization
tanh function
parameter sharing
regularize parameters to be close together by forcing sets of parameters to be equal
Normalization is helpful as it can
Color jitter
Data Augmentation
The key principle for NN training
Sanity checks for learning after optimization
L2 regularization results in a solution that is ___ sparse than L1
L2 regularization results in a solution that is less sparse than L1
Why is initialization of parameters important
What suggests overfitting when looking at validation/training curve?

Shared Weights
sigmoid function
ReLU
Sigmoid is typically avoided unless ___
you want to clamp values from [0,1] (ie. logistic regression)
Simpler Xavier initialization (Xavier2)
N(0,1) * square root (1 / nj)
How to Prevent Co-Adapted Features
Fully Connected Neural Network
more and more abstract features from raw input
not well-suited for images
Why does dropout work
Pooling Layer
Layer to explicitly down-sample image or feature maps (dimensionality reduction)
What is the number of parameters for a CNN with Kn kernels and 3 channels?
N * ( k1 * k2 *…* kn * 3 + 1)
L2 regularization
Sigmoid Function Key facts
Definition of accuracy with respect to TP, TN, FP, FN
TP + TN / (TP + TN + FP + FN)