Fill in the gaps
The convention used in Linear Algebra and re-used in this course is to use upper case variable names for ………. and lower case variable names for ……….
The convention used in Linear Algebra and re-used in this course is to use upper case variable names for MATRICES and lower case variable names for VECTORS or SCALARS.
Which of the following activation functions is the most common choice for the hidden layers of a neural network?
ReLU
Rectified Linear Unit
What do people mean when they say the hidden layers of a neural network does not employ an activation function?
This means that the output of a layer is equal to its input, which is also the behaviour of linear activation function.
double-check statement
At a high-level, what are the three key steps to train an artificial neural network?
f_wb(X)L(f_wb(X),y)J(w,b)W.r.t. to neural networks, what is an epoch?
It refers to one complete pass of the entire training dataset through the learning algorithm. In other words, when all the data samples have been exposed to the neural network for learning patterns, one epoch is said to be completed.
Describe the three key steps involved in training a neural network in TensorFlow.
The first step involves specifying the model architecture, defining the layers and their connections for inference. Second, the model is compiled by choosing a suitable loss function, such as binary cross-entropy. Finally, the fit function is called to train the model on the provided dataset for a specified number of epochs, optimising the parameters to minimise the chosen loss.
What is the primary architectural difference between a dense layer and a convolutional layer in a neural network?
In a dense layer, every neuron receives input from all activations in the preceding layer. In contrast, a convolutional layer’s neurons receive input only from a limited, specific region (or “window”) of the input from the previous layer, which helps in tasks like image processing or time-series analysis by focusing on local features.
Explain the main advantages of using a ReLU activation function in hidden layers compared to a sigmoid activation function.
ReLU is computationally faster than sigmoid as it only involves max(0, z). More importantly, ReLU helps prevent the “vanishing gradient” problem because it only goes flat on one side (for negative values), whereas sigmoid goes flat on both extremes, slowing down gradient descent in those regions.
How does the Adam optimisation algorithm differ from basic gradient descent in its approach to learning rates?
Adam (Adaptive Moment Estimation) automatically adjusts the learning rate during training, unlike basic gradient descent which uses a single, fixed global learning rate. Furthermore, Adam uses a different, adaptive learning rate for each individual parameter of the model, accelerating learning by adjusting speeds based on the parameter’s movement (e.g., increasing if moving consistently in one direction, decreasing if oscillating).
Why is it crucial to use non-linear activation functions (i.e., not just linear activation) in the hidden layers of a neural network?
If only linear activation functions are used in all hidden layers, the entire neural network, regardless of its depth, would effectively reduce to a simple linear function (like linear regression). Non-linear activation functions allow the network to learn and represent complex, non-linear relationships in the data, which is essential for solving most real-world problems that linear models cannot address.
What is the binary cross-entropy loss function, and for what type of problem is it typically used?
The binary cross-entropy loss function measures the performance of a classification model whose output is a probability value between 0 and 1. It’s typically used for binary classification problems, where the target label Y can only take on two values (e.g., 0 or 1), by quantifying the difference between the predicted probabilities and the true labels.
Briefly explain the informal definition of a derivative as presented in the context of neural network learning.
Informally, a derivative (K) indicates how much a function’s output (J of W) changes when its input (W) changes by a tiny amount (epsilon). If W goes up by epsilon, and J(W) goes up by K * epsilon, then K is the derivative of J(W) with respect to W. This concept is crucial for understanding how small adjustments to parameters affect the cost function.
Distinguish between multiclass classification and multilabel classification, providing an example for each.
Multiclass classification problems involve predicting one label out of more than two possible discrete categories (e.g., classifying handwritten digits from 0-9, where an image is only one digit). Multilabel classification problems, on the other hand, involve predicting multiple labels for a single input simultaneously (e.g., identifying if a picture contains a car, a bus, and/or a pedestrian).
Why is it recommended to use the from_logits=True argument when compiling a model with BinaryCrossentropy or CategoricalCrossentropy loss in TensorFlow?
Using from_logits=True allows TensorFlow to combine the activation function (like sigmoid or softmax) and the loss calculation into a single, numerically more stable operation. This helps to reduce numerical round-off errors, especially when dealing with very small or very large intermediate values, leading to more accurate and reliable training.
What is backpropagation, and why is it an important algorithm in neural network training?
Backpropagation is a key algorithm in neural network learning that efficiently computes the derivatives of the cost function with respect to all the model’s parameters. This is crucial because these derivatives are then used by optimisation algorithms like gradient descent or Adam to update the parameters, allowing the neural network to learn and minimise its cost function.
What is Activation (a)?
The output value of an artificial neuron or a layer of neurons, often representing a probability or a transformed input
It is analogous to how much a biological neuron is ‘firing.’
Define Activation Function (G).
A non-linear function applied to the output of each neuron’s linear combination of inputs and weights
Examples include the sigmoid function.
What is an Artificial Neural Network (ANN)?
A computational model inspired by the structure and function of biological neural networks, designed to recognise patterns and relationships in data.
What is the role of an Axon?
The output wire of a biological neuron that transmits electrical impulses to other neurons.
What is Backward Propagation (Backpropagation)?
An algorithm used for training neural networks, which propagates errors backwards through the network to adjust weights and biases.
Define Bias (b) in the context of neural networks.
A parameter in a neuron that acts as an offset, shifting the activation function’s output.
What is the Cell Body (Nucleus) of a neuron?
The main part of a biological neuron where computations occur.
What is a Column Vector?
A matrix with a single column (e.g., Nx1 matrix).
What is Deep Learning?
A subfield of machine learning that uses neural networks with multiple layers to learn from data.