fit training set well on cost function, if not perform well what to do?
bigger network, Adam
fit train set well but not dev set?
regularization, bigger train set, early stopping
fit dev set well but not test set?
bigger dev set
performs well on test set, not on real world?
change dev set or cost function