What is learning by gradient descent? Explain the general idea behind it, and the role the error E has in it.
Briefly describe what the backpropagation algorithm is, and in which way it relates to gradient descent.
What are the common problems of gradient descent, that may limit its effectiveness?
Explain the role of activation functions in NN
They play a crucial role by introducing non-linearities to the model, which are essential for enabling NN to learn complex patterns in the data
What is the purpose of the cost function in a NN
Also known as the loss function, it quantifies the inconsistency between predicted values and the corresponding correct values
explain the role of bias terms in a NN
What is a perceptron
an artificial neuron which takes in many input signals and produces a single binary output signal (0 or 1)
Explain the differences between Batch Gradient Descent and Stochastic Gradient Descent
in BGD, the model parameters are updated in one go, based on the average gradient of the entire training dataset. In SGD, updates occur for each training example or mini-batch.
Which gradient descent is preferred for large datasets and why.
Stochastic GD is preferred over Batch GD.
Define generalisation
The ability of a trained model to perform well on unseen data
How can you measure the generalisation ability of a MLP
How can you decide on an optimal number of hidden units?
Explain the difference between two common activation functions of your choice
Sigmoid vs TanH
1. Output Range:
- Sigmoid: (0,1): used for binary classification
- tanh: (-1, 1): suitable for zero-centred data
2. Symmetry:
- Sigmoid is asymmetric, biased towards positive values
- tanh is symmetric around the origin (0, 0)
What are the problems with squared error as the loss function, give two alternatives
There are tricky problems with squared error:
Define what a Deep Neural Network is
Formal definition of overfitting in practice
during learning the error on the training examples decreases all along, but the error in generalisation reaches a minimum and then starts growing again.
Training data contains information about the regularities in the mapping from input to output.
But it also contains noise, explain how.
When we fit a model, it cannot tell which regularities are real and which are caused by sampling error. Which regularity does it fit, what is the worst case scenario?
What does a model having the “right capacity” entail
How to prevent overfitting in NN
Standard ways to limit the capacity of a neural net
How to limit the size of the model by using fewer hidden units in practice
trial and error
What is weight-decay
What does weight decay prevent, and what does it improve and how?