Name a function that models the all-or-nothing response of biological neurons.
Threshold
Why is the signum function not used in deep learning?
not differentiable at x=0
=> derivative of 0 or undefined → no learning would occur i.e. weights update would be 0 → never improve performance of network on the training data.
What is Dropout?
What is the objective function of Rosenblatt’s perceptron?
Find weights that minimize the distance of misclassified samples to the decision boundary
Classification on the sign of the distance.
Why is it useful to learn a bias term in training?
helps in offsetting the result,
reducing errors during computation by activation functions,
and ensuring a non-null output even when the input is null.
–> providing flexibility in shifting the activation function, thus improving the model’s ability to fit the data and make more precise predictions
What is the task of Softmax function as the last layer in a neural network for a classification task?
produces a probability distribution over the classes for each input
By:
- rescale so that output sums up to 1
- produce non-negative output
= is normalized exponential function
What are the advantages of making a neural network deeper?
What does backpropagation do?
computes all gradients required for the optimization of the network.
What is the exploding gradient problem?
The updates in earlier layers can be increasingly large.
when the learning rate is too high –> positive feedback –> loss grows without bound
What is the vanishing gradient problem?
The updates in earlier layers can be negligibly small.
When learning rate is too small –>negative feedback –> gradient vanishes
What is the standard loss function for classification?
Cross-entropy loss: assumes that the outputs can be interpreted as probabilities that the input belongs to each class
Specifically, it assumes that the data follow a Bernoulli (for binary classification) or Multinoulli/Categorical (for multi-class classification) distribution.
What is the standard loss function for regression?
L2-loss: assumes that the residuals (i.e. differences between the true and predicted values) follow a Gaussian distribution.
What is Batch Gradient Descent? BGD
What is Stochastic (Online) Gradient Descent?
What is Mini-Batch SGD?
What are the steps to train a neural network using the backpropagation algorithm and an optimizer like Stochastic Gradient Descent (SGD)?
-randomly initiate the weights and biases
-forward the input into the network, get the output
compute the difference between the estimation and prediction
-tune the weights and biases of each neuron to minimize the loss
-iterate until the weights are optimized
What is the idea of momentum-based learning?
Idea: Accelerate in directions with persistent gradients
-Parameter update based on current and past gradients i.e. use previous gradient directions to accelerate the training and become more robust against local minima.
What is the purpose of the Momentum used in different optimizers?
It stabilizes the training by computing the moving average over the previous gradients.
What is the zero-centering problem?
the lack of zero-centered output when the sigmoid function is used as an activation function in training neural networks.
–> covariate shift of successive layers
how to solve the zero-centering problem?
Batch normalization which standardizes the inputs to each layer to have zero mean and unit variance, reducing the amount of internal covariate shift.
What is the dying ReLUs problem?
What are the disadvantages of the powerful neural network with fully connected layers that motivate CNN?
What are the advantages of CNN in comparison with fully connected neural networks?
-Local connectivity
-Weight sharing
-translational invariance (recognize patterns irrespective of their position in the input)
-grid-like alignment of images
What decides the choice of function to apply to the output of CNN for the classification problems?