How does a transformer compute self-attention using Q, K, and V, and why is masking needed in the decoder during training?
Self-attention lets each token learn from all other tokens to build context-aware representations.
Formula:
Attention(Q, K, V) = softmax((QKᵀ / √dₖ) + M) × V
Step-by-step:
Why masking is needed:
T or False
Keras is low-level API for building deep learning models
False
Keras is high-level API, it runs on top of low-level lib like Tensorflow
Q1: What is a shallow neural network?
A: A network with only one or two hidden layer between the input and output layers.
Take vector as input
Q1: What is the vanishing gradient problem in neural networks?
When we train a deep neural network, we update weights using backpropagation, which applies the chain rule through multiple layers.
If the activation function has a small derivative (like sigmoid or tanh), the product of many small derivatives makes the gradient shrink exponentially as it flows backward through layers → the vanishing gradient problem.
Activation | Output Range | Derivative Range | Gradient Behavior | Comments |
| ———- | ———— | —————- | ——————- | —————————– |
| Sigmoid | (0, 1) | (0, 0.25) | Vanishes quickly | Smooth, but bad for deep nets |
| Tanh | (-1, 1) | (0, 1) | Vanishes moderately | Better than sigmoid |
| ReLU | [0, ∞) | {0, 1} | Stable | Simple, efficient |
| Leaky ReLU | [αx, ∞) | {α, 1} | Stable, nonzero | Fixes dead ReLU |
Q4: What is the maximum derivative value of the sigmoid activation?
A4: 0.25, which occurs when the input z equals 0.
Q7: What is the derivative of ReLU?
A7: f’(z) = 1 if z > 0, otherwise 0.
Q9: What is the main drawback of ReLU despite its advantages?
A9: The “dead ReLU” problem — when neurons output zero for all inputs (z ≤ 0) and stop updating because their gradient is always zero.
Q10: What activation functions were introduced to fix the “dead ReLU” problem?
A10: Leaky ReLU, ELU, SELU, and GELU. They keep small gradients even for negative inputs.
Q9 (Short Answer): What is the main purpose of an RNN?
A9: To process sequential data by maintaining a hidden state that stores information from previous time steps, allowing temporal dependencies to be learned.
The hidden state in an RNN at time t, denoted hₜ, is computed from the current input xₜ and the previous hidden state _______.
hₜ₋₁ (the previous hidden state).
What do the RNN weight matrices Wxh, Whh, and Why represent?
Wxh: weights connecting input xₜ to hidden state hₜ.
Whh: weights connecting previous hidden state hₜ₋₁ to current hidden state hₜ.
Why: weights connecting hidden state hₜ to output yₜ.
Write the general RNN update equations.
hₜ = f(Wxh * xₜ + Whh * hₜ₋₁ + bₕ)
yₜ = Why * hₜ + b_y
where f is an activation function such as tanh or ReLU.
Which of the following tasks best fits an RNN?
A. Image classification
B. Sentiment analysis of a sentence
C. Sorting numbers
D. Predicting random noise
B. Sentiment analysis of a sentence — it involves sequential word dependencies.
Why is the vanishing gradient problem severe in RNNs?
Because RNNs reuse the same weight matrix (Whh) at each time step, causing gradients to be repeatedly multiplied by small values (from activation derivatives) across time.
Consequence of vanishing gradient in RNN
Early time steps (e.g., the beginning of a sentence) stop influencing model updates.
The model can only learn short-term dependencies because the gradient from long-term context becomes too small.
The network “forgets” older information during training.
The main building block of a CNN is the _______ layer, which applies filters to extract features.
convolution (Conv2D)
How is the number of output channels in a Conv2D layer determined?
By the number of filters specified in the layer, not by the number of input channels.
How do Conv2D layer dimensions change?
Height and width are computed using the formula:
Valid padding: H_out = H_in - filter_height + 1, W_out = W_in - filter_width + 1
Channels = number of filters
(True/False): Increasing the number of filters always improves CNN performance.
True/False): Increasing the number of filters always improves CNN performance.
What is the purpose of MaxPooling2D in CNN?
To reduce spatial dimensions, summarize features, and improve translation invariance.
(True/False): Pooling layers change the number of channels.
False — they only reduce spatial dimensions.
Short Answer): Why do we use a Flatten layer before Dense layers?
2: To convert 2D or 3D feature maps into a 1D vector suitable for fully connected layers.
(Multiple Choice): What is the role of the final Dense layer with softmax activation?
A. Extract low-level features
B. Perform classification by outputting probabilities for each class
C. Reduce spatial dimensions
D. Detect edges
B. Perform classification by outputting probabilities for each class
(True/False): Later layers in CNNs always need more filters than earlier layers.
: False — sometimes fewer filters are used in later layers to reduce computation while capturing abstract features.