What is the point of activation functions
Without activation functions, a neural network would only perform linear transformations, no matter how many layers it has. Activation functions make the network non-linear, which allows it to model real-world problems such as image recognition and language understanding.
List all the non-linear activation function
Sigmoid
Softmax
Tanh
ReLU
Leaky ReLU
Explain the sigmoid activation function
The sigmoid activation function maps inputs to values between 0 and 1, allowing outputs to be interpreted as probabilities. However, it suffers from the vanishing gradient problem. When the input is very large or very small weight updates become extremely small which slows learning
Explain the softmax function
The softmax function takes a vector of raw scores (also called logits) and converts them into a probability distribution over multiple classes.
give an example of softmax
The network produces raw scores for each class
Example: [2.0, 1.0, 0.1]
Softmax:
Makes all values positive
Emphasises larger scores
Normalises them so they sum to 1
Output becomes probabilities:
Example: [0.7, 0.2, 0.1]
The class with the highest probability is the prediction.
What is tanh activation function
The tanh (hyperbolic tangent) activation function maps any real input to a value between −1 and 1.
Why is tanh useful
Outputs include negative, zero, and positive values
This makes it easier for the network to represent:
- Positive relationships
- Negative relationships
Being zero-centred helps training converge faster than sigmoid
What problem does tanh also suffer
the vanishing gradient problem
What is the reLU activation function
ReLU (Rectified Linear Unit) outputs the input directly if it is positive, and 0 if it is negative.
Where is reLU used
It is widely used in hidden layers because it is computationally efficient and reduces the vanishing gradient problem, although it can suffer from the dying ReLU issue.
what is back propagation
An algorithm that computes gradients of the loss with respect to each weight using the chain rule.
Why is back propagation essential
Because it allows the network to learn by identifying how each parameter affects the error.
What is the learning rate
The learning rate controls the step size of weight updates during optimisation.
What is the purpose of a loss function
The “loss function” is a function that measures how badly the AI system did by comparing its predicted output to the ground truth and then penalising the model
Explain why cross-entropy is preferred over squared error for classification.
Because cross-entropy heavily penalises confident incorrect predictions and aligns with probabilistic outputs.
Why is binary cross-entropy paired with sigmoid activation?
Because sigmoid outputs probabilities in the range [0-1]
What does minimising cross-entrophy result in
maximising liklihood
The model becomes more confident when it is correct
And is heavily penalised when it is confidently wrong
Cross-entropy penalises the model if it gives high probability to a wrong class — especially if it’s very confident and wrong.
In deep learning, an activation function should be
Derivative/gradient
What activation function is typically paired with binary cross-entropy?
,Sigmoid.
What happens in binary cross-entropy when y = 1?,
The loss penalises low predicted probability y*.
What happens in binary cross-entropy when y = 0?
The loss penalises high predicted probability y*.
What activation function is paired with multi-class cross-entropy?
Softmax
How do you find the best weights that yield the smallest loss
we could solve gradient(loss)(W) = 0, we’d directly find the best weights.