FCNs Flashcards by Toby Beckett

What is the point of activation functions

Without activation functions, a neural network would only perform linear transformations, no matter how many layers it has. Activation functions make the network non-linear, which allows it to model real-world problems such as image recognition and language understanding.

How well did you know this?

Not at all

Perfectly

List all the non-linear activation function

Sigmoid
Softmax
Tanh
ReLU
Leaky ReLU

How well did you know this?

Not at all

Perfectly

Explain the sigmoid activation function

The sigmoid activation function maps inputs to values between 0 and 1, allowing outputs to be interpreted as probabilities. However, it suffers from the vanishing gradient problem. When the input is very large or very small weight updates become extremely small which slows learning

How well did you know this?

Not at all

Perfectly

Explain the softmax function

The softmax function takes a vector of raw scores (also called logits) and converts them into a probability distribution over multiple classes.

How well did you know this?

Not at all

Perfectly

give an example of softmax

The network produces raw scores for each class
Example: [2.0, 1.0, 0.1]
Softmax:
Makes all values positive
Emphasises larger scores
Normalises them so they sum to 1
Output becomes probabilities:
Example: [0.7, 0.2, 0.1]
The class with the highest probability is the prediction.

How well did you know this?

Not at all

Perfectly

What is tanh activation function

The tanh (hyperbolic tangent) activation function maps any real input to a value between −1 and 1.

How well did you know this?

Not at all

Perfectly

Why is tanh useful

Outputs include negative, zero, and positive values

This makes it easier for the network to represent:
- Positive relationships
- Negative relationships
Being zero-centred helps training converge faster than sigmoid

How well did you know this?

Not at all

Perfectly

What problem does tanh also suffer

the vanishing gradient problem

How well did you know this?

Not at all

Perfectly

What is the reLU activation function

ReLU (Rectified Linear Unit) outputs the input directly if it is positive, and 0 if it is negative.

How well did you know this?

Not at all

Perfectly

Where is reLU used

It is widely used in hidden layers because it is computationally efficient and reduces the vanishing gradient problem, although it can suffer from the dying ReLU issue.

How well did you know this?

Not at all

Perfectly

what is back propagation

An algorithm that computes gradients of the loss with respect to each weight using the chain rule.

How well did you know this?

Not at all

Perfectly

Why is back propagation essential

Because it allows the network to learn by identifying how each parameter affects the error.

How well did you know this?

Not at all

Perfectly

What is the learning rate

The learning rate controls the step size of weight updates during optimisation.

How well did you know this?

Not at all

Perfectly

What is the purpose of a loss function

The “loss function” is a function that measures how badly the AI system did by comparing its predicted output to the ground truth and then penalising the model

How well did you know this?

Not at all

Perfectly

Explain why cross-entropy is preferred over squared error for classification.

Because cross-entropy heavily penalises confident incorrect predictions and aligns with probabilistic outputs.

How well did you know this?

Not at all

Perfectly

Why is binary cross-entropy paired with sigmoid activation?

Study These Flashcards

Because sigmoid outputs probabilities in the range [0-1]

What does minimising cross-entrophy result in

Study These Flashcards

maximising liklihood

The model becomes more confident when it is correct
And is heavily penalised when it is confidently wrong

Cross-entropy penalises the model if it gives high probability to a wrong class — especially if it’s very confident and wrong.

In deep learning, an activation function should be

Study These Flashcards

Differentiable
A function of its input (the neuron’s pre-activated value)
a smooth continuous function

Derivative/gradient

Study These Flashcards

gradient is the generalisation of a derivative
Gradient indicates the direction and rate of the steepest ascent of the error function
Since we want to minimise the error function, we move in the opposite direction

What activation function is typically paired with binary cross-entropy?

Study These Flashcards

,Sigmoid.

What happens in binary cross-entropy when y = 1?,

Study These Flashcards

The loss penalises low predicted probability y*.

What happens in binary cross-entropy when y = 0?

Study These Flashcards

The loss penalises high predicted probability y*.

What activation function is paired with multi-class cross-entropy?

Study These Flashcards

Softmax

How do you find the best weights that yield the smallest loss

Study These Flashcards

we could solve gradient(loss)(W) = 0, we’d directly find the best weights.

Why can't we find the best weights

Why don’t we actually solve it? - Neural networks are too complex - Loss functions are non-linear - Millions of weights So instead, we approximate the solution using gradient descent.

Explain Stochastic gradient descent

* Select a random training sample xi and corresponding target yi. * Run the network on xi to obtain prediction y_predi. * Compute the loss of the network on the sample xi, a measure of the mismatch between y_predi and yi. * Compute the gradient of the loss with regard to the network’s parameters (a backward pass). * Move the parameters a little in the opposite direction from the gradient—for example W -= step * gradient— thus reducing the loss on the sample a bit.

Explain mini batch stochastic gradient descent

* Draw a batch of training samples x and corresponding targets y. * Run the network on x to obtain predictions y_pred. * Compute the loss of the network on the batch, a measure of the mismatch between y_pred and y. * Compute the gradient of the loss with regard to the network’s parameters (a backward pass). * Move the parameters a little in the opposite direction from the gradient—for example W -= step * gradient—thus reducing the loss on the batch a bit.

How can dropout overcome overfitting

During training, dropout randomly “drops” (turns off) a fraction of neurons in each layer. This forces the network to learn redundant and more robust features, rather than memorising the training data. As a result, the model generalises better to unseen data as it focuses on more general features.

What are hyper parameters that can be tuned

For a given neural network, there are several multiple parameters that can be optimised including the number of hidden neurons, BATCH_SIZE, number of epochs.

What is the point of hyper parameter tuning

Hyperparameter tuning is the process of finding the optimal combination of those parameters that minimise the loss function.

What does Epochs tell us

it tells us the number of times all the training samples have passed through the neural network during the training process.

What does Batch size tell us

It is used to partition the training data in mini batches to pass them through the network. In Keras, the batch_size is the argument that indicates the size of these batches that will be used in the fit() method in an iteration of the training to update the gradient.

What does learning rate do

Gradient descent algorithms multiply the magnitude of the gradient by a scalar known as learning rate (also sometimes called step size) to determine the next point.

What is one of the commonly used learning optimisers

Adam (adaptive moment estimation) - controls the step size in the model

FCNs Flashcards

(34 cards)