What is the ADAM algorithm?
Adam is a variation of gradient descent. Estimates the moment of the gradient to adapt the step-size in each dimension separately.
The moments; how much does the gradient change. If the gradient has a big variance (change a lot), function changes a lot -> small step size.
Gradient changes a small bit -> large step-size since continuous.
Depending on the change of the gradient, we can either choose a small or big step-size.
Heuristic, but very popular heuristic. Mathematically well derived, but works.
1st moment -> mean
2nd moment -> variance.
Three parameters; learning rate, decay rate -> control how much in the past do we look at with the gradient,
Explain the comparison between the algorithms of gradient descent and its variation, Adam algorithm when applying different functions.
Adam is a lot more slower at the start, in comparison to steepest gradient descent. However, after a certain point it reaches the minimum rapidly.
Explain the use of hidden layers in Multi-Layer Perceptron and explain if the number of hidden layers effect the system.
The concept of Multi-Layer Perceptron is used exchangeable with Feed-forward Network.
The architecture of an MLP can be represented as follows:
Input Layer -> Hidden Layer 1 -> Hidden Layer 2 -> … -> Output Layer
The hidden layers in an MLP play a crucial role in feature extraction and learning complex representations of the input data. Each neuron in the hidden layers receives inputs from the previous layer, applies certain weights to those inputs, and passes the weighted sum through an activation function, introducing non-linearity to the model. The non-linearity enables the network to learn and approximate complex functions, making it more powerful than a linear model.
The hidden layer neurons allow for the function of a neural network to be broken down into specific transformations of the data.
For dimensional; 10 layers is sufficient, but the number of layers is dependent on the number of layers and dimensions.
Why is tanh used in ML and state the advantages
State why using tanh can be a problem in ML
Explain the vanishing gradient problem.
The vanishing gradient problem occurs when the gradient values become extremely small, typically close to zero, as they are propagated from the deeper layers (closer to the output) to the shallower layers (closer to the input) of the network. Consequently, the weights in the shallower layers receive negligible updates, slowing down the learning process for those layers. As a result, the network may fail to learn meaningful representations in the early layers, limiting the overall performance of the model.
What is the difference between the loss function and the cost function
Used interchangeably, but in some contexts, can refer to slightly different concepts.
What is an epoch?
An epoch is completed when the model ahs seen and processed every training sample once.
In Stochasic Gradient Descent, we do not care at arriving at a global or local minimum.
What does this mean for the topic?
What is batch size important for Stochastic Gradient Descent
A small batch size leads to a slightly bad approximation of the gradient in some cases and may help us escape from a local minimum.
so, therefore:
Explain the concept of backpropagation
The training of the neural network done by using backpropagation.
The backward propagation algo. allows the information from the gradient to flow backwards through the network in order to compute the gradient.
What batch size we want in image processing and image classification
Mostly due to complexity reasons; single examples are high-dimensional, networks are large and number of available examples are small.
What type of batch size we want in communications
We want a bigger mini-batch in communications.
How do we choose a batch size in communication
Solution A: Use large mini-batch sizes, which we generate on the fly anyhow
Solution B: Increase mini-batch size during training
Give the definition of a Convolutional Network
ConvNets are simply NN which use convolution instead of matrix multiplication in some of their layers.
What are the complexity advantages of CNN
What are the equivariant representations of CNN