FCNs Flashcards

(34 cards)

1
Q

What is the point of activation functions

A

Without activation functions, a neural network would only perform linear transformations, no matter how many layers it has. Activation functions make the network non-linear, which allows it to model real-world problems such as image recognition and language understanding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List all the non-linear activation function

A

Sigmoid
Softmax
Tanh
ReLU
Leaky ReLU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the sigmoid activation function

A

The sigmoid activation function maps inputs to values between 0 and 1, allowing outputs to be interpreted as probabilities. However, it suffers from the vanishing gradient problem. When the input is very large or very small weight updates become extremely small which slows learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the softmax function

A

The softmax function takes a vector of raw scores (also called logits) and converts them into a probability distribution over multiple classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

give an example of softmax

A

The network produces raw scores for each class
Example: [2.0, 1.0, 0.1]
Softmax:
Makes all values positive
Emphasises larger scores
Normalises them so they sum to 1
Output becomes probabilities:
Example: [0.7, 0.2, 0.1]
The class with the highest probability is the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is tanh activation function

A

The tanh (hyperbolic tangent) activation function maps any real input to a value between −1 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is tanh useful

A

Outputs include negative, zero, and positive values

This makes it easier for the network to represent:
- Positive relationships
- Negative relationships
Being zero-centred helps training converge faster than sigmoid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What problem does tanh also suffer

A

the vanishing gradient problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the reLU activation function

A

ReLU (Rectified Linear Unit) outputs the input directly if it is positive, and 0 if it is negative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where is reLU used

A

It is widely used in hidden layers because it is computationally efficient and reduces the vanishing gradient problem, although it can suffer from the dying ReLU issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is back propagation

A

An algorithm that computes gradients of the loss with respect to each weight using the chain rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is back propagation essential

A

Because it allows the network to learn by identifying how each parameter affects the error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the learning rate

A

The learning rate controls the step size of weight updates during optimisation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the purpose of a loss function

A

The “loss function” is a function that measures how badly the AI system did by comparing its predicted output to the ground truth and then penalising the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain why cross-entropy is preferred over squared error for classification.

A

Because cross-entropy heavily penalises confident incorrect predictions and aligns with probabilistic outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is binary cross-entropy paired with sigmoid activation?

A

Because sigmoid outputs probabilities in the range [0-1]

17
Q

What does minimising cross-entrophy result in

A

maximising liklihood

The model becomes more confident when it is correct
And is heavily penalised when it is confidently wrong

Cross-entropy penalises the model if it gives high probability to a wrong class — especially if it’s very confident and wrong.

18
Q

In deep learning, an activation function should be

A
  1. Differentiable
  2. A function of its input (the neuron’s pre-activated value)
  3. a smooth continuous function
19
Q

Derivative/gradient

A
  • gradient is the generalisation of a derivative
  • Gradient indicates the direction and rate of the steepest ascent of the error function
  • Since we want to minimise the error function, we move in the opposite direction
20
Q

What activation function is typically paired with binary cross-entropy?

21
Q

What happens in binary cross-entropy when y = 1?,

A

The loss penalises low predicted probability y*.

22
Q

What happens in binary cross-entropy when y = 0?

A

The loss penalises high predicted probability y*.

23
Q

What activation function is paired with multi-class cross-entropy?

24
Q

How do you find the best weights that yield the smallest loss

A

we could solve gradient(loss)(W) = 0, we’d directly find the best weights.

25
Why can't we find the best weights
Why don’t we actually solve it? - Neural networks are too complex - Loss functions are non-linear - Millions of weights So instead, we approximate the solution using gradient descent.
26
Explain Stochastic gradient descent
* Select a random training sample xi and corresponding target yi. * Run the network on xi to obtain prediction y_predi. * Compute the loss of the network on the sample xi, a measure of the mismatch between y_predi and yi. * Compute the gradient of the loss with regard to the network’s parameters (a backward pass). * Move the parameters a little in the opposite direction from the gradient—for example W -= step * gradient— thus reducing the loss on the sample a bit.
27
Explain mini batch stochastic gradient descent
* Draw a batch of training samples x and corresponding targets y. * Run the network on x to obtain predictions y_pred. * Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. * Compute the gradient of the loss with regard to the network’s parameters (a backward pass). * Move the parameters a little in the opposite direction from the gradient—for example W -= step * gradient—thus reducing the loss on the batch a bit.
28
How can dropout overcome overfitting
During training, dropout randomly “drops” (turns off) a fraction of neurons in each layer. This forces the network to learn redundant and more robust features, rather than memorising the training data. As a result, the model generalises better to unseen data as it focuses on more general features.
29
What are hyper parameters that can be tuned
For a given neural network, there are several multiple parameters that can be optimised including the number of hidden neurons, BATCH_SIZE, number of epochs.
30
What is the point of hyper parameter tuning
Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimise the loss function.
31
What does Epochs tell us
it tells us the number of times all the training samples have passed through the neural network during the training process.
32
What does Batch size tell us
It is used to partition the training data in mini batches to pass them through the network. In Keras, the batch_size is the argument that indicates the size of these batches that will be used in the fit() method in an iteration of the training to update the gradient.
33
What does learning rate do
Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.
34
What is one of the commonly used learning optimisers
Adam (adaptive moment estimation) - controls the step size in the model