Deep Learning Flashcards by Anh Ha

How does a transformer compute self-attention using Q, K, and V, and why is masking needed in the decoder during training?

Self-attention lets each token learn from all other tokens to build context-aware representations.

Formula:
Attention(Q, K, V) = softmax((QKᵀ / √dₖ) + M) × V

Step-by-step:

Linear projections:
- Each input token embedding X is projected into:
  Q = XW_Q, K = XW_K, V = XW_V
- Q = query (what this token looks for)
- K = key (what this token offers)
- V = value (the information to pass)
Compute attention scores:
- Calculate similarity between each query and all keys:
  Scores = QKᵀ
Scale the scores:
- Divide by √dₖ to stabilize gradients:
  ScaledScores = Scores / √dₖ
  This normalizes the variance of the dot products to around 1, keeping the softmax output smoother and gradients well-behaved.
Apply mask (in decoder):
- Add a causal mask M so each token can only attend to itself and past tokens.
- M sets positions for future tokens to −∞ (so softmax ≈ 0).
- Prevents the model from seeing future words during training.
Softmax normalization:
- Convert scaled scores into attention weights:
  A = softmax(ScaledScores + M)
Weighted sum of values:
- Combine values using attention weights:
  Output = A × V

Why masking is needed:

In encoder: all tokens can attend to each other (no mask).
In decoder: we use a causal mask to stop tokens from attending to future positions.
This ensures the model learns to predict each token only from previous context, not by “cheating” and seeing future words.

How well did you know this?

Not at all

Perfectly

T or False
Keras is low-level API for building deep learning models

False
Keras is high-level API, it runs on top of low-level lib like Tensorflow

How well did you know this?

Not at all

Perfectly

Q1: What is a shallow neural network?

A: A network with only one or two hidden layer between the input and output layers.
Take vector as input

How well did you know this?

Not at all

Perfectly

Q1: What is the vanishing gradient problem in neural networks?

When we train a deep neural network, we update weights using backpropagation, which applies the chain rule through multiple layers.
If the activation function has a small derivative (like sigmoid or tanh), the product of many small derivatives makes the gradient shrink exponentially as it flows backward through layers → the vanishing gradient problem.

Activation | Output Range | Derivative Range | Gradient Behavior | Comments |
| ———- | ———— | —————- | ——————- | —————————– |
| Sigmoid | (0, 1) | (0, 0.25) | Vanishes quickly | Smooth, but bad for deep nets |
| Tanh | (-1, 1) | (0, 1) | Vanishes moderately | Better than sigmoid |
| ReLU | [0, ∞) | {0, 1} | Stable | Simple, efficient |
| Leaky ReLU | [αx, ∞) | {α, 1} | Stable, nonzero | Fixes dead ReLU |

How well did you know this?

Not at all

Perfectly

Q4: What is the maximum derivative value of the sigmoid activation?

A4: 0.25, which occurs when the input z equals 0.

How well did you know this?

Not at all

Perfectly

Q7: What is the derivative of ReLU?

A7: f’(z) = 1 if z > 0, otherwise 0.

How well did you know this?

Not at all

Perfectly

Q9: What is the main drawback of ReLU despite its advantages?

A9: The “dead ReLU” problem — when neurons output zero for all inputs (z ≤ 0) and stop updating because their gradient is always zero.

How well did you know this?

Not at all

Perfectly

Q10: What activation functions were introduced to fix the “dead ReLU” problem?

A10: Leaky ReLU, ELU, SELU, and GELU. They keep small gradients even for negative inputs.

How well did you know this?

Not at all

Perfectly

Q9 (Short Answer): What is the main purpose of an RNN?

A9: To process sequential data by maintaining a hidden state that stores information from previous time steps, allowing temporal dependencies to be learned.

How well did you know this?

Not at all

Perfectly

The hidden state in an RNN at time t, denoted hₜ, is computed from the current input xₜ and the previous hidden state _______.

hₜ₋₁ (the previous hidden state).

How well did you know this?

Not at all

Perfectly

What do the RNN weight matrices Wxh, Whh, and Why represent?

Wxh: weights connecting input xₜ to hidden state hₜ.

Whh: weights connecting previous hidden state hₜ₋₁ to current hidden state hₜ.

Why: weights connecting hidden state hₜ to output yₜ.

How well did you know this?

Not at all

Perfectly

Write the general RNN update equations.

hₜ = f(Wxh * xₜ + Whh * hₜ₋₁ + bₕ)
yₜ = Why * hₜ + b_y
where f is an activation function such as tanh or ReLU.

How well did you know this?

Not at all

Perfectly

Which of the following tasks best fits an RNN?
A. Image classification
B. Sentiment analysis of a sentence
C. Sorting numbers
D. Predicting random noise

B. Sentiment analysis of a sentence — it involves sequential word dependencies.

How well did you know this?

Not at all

Perfectly

Why is the vanishing gradient problem severe in RNNs?

Because RNNs reuse the same weight matrix (Whh) at each time step, causing gradients to be repeatedly multiplied by small values (from activation derivatives) across time.

How well did you know this?

Not at all

Perfectly

Consequence of vanishing gradient in RNN

Early time steps (e.g., the beginning of a sentence) stop influencing model updates.

The model can only learn short-term dependencies because the gradient from long-term context becomes too small.

The network “forgets” older information during training.

How well did you know this?

Not at all

Perfectly

The main building block of a CNN is the _______ layer, which applies filters to extract features.

Study These Flashcards

convolution (Conv2D)

How is the number of output channels in a Conv2D layer determined?

Study These Flashcards

By the number of filters specified in the layer, not by the number of input channels.

How do Conv2D layer dimensions change?

Study These Flashcards

Height and width are computed using the formula:

Valid padding: H_out = H_in - filter_height + 1, W_out = W_in - filter_width + 1

Channels = number of filters

(True/False): Increasing the number of filters always improves CNN performance.

Study These Flashcards

True/False): Increasing the number of filters always improves CNN performance.

What is the purpose of MaxPooling2D in CNN?

Study These Flashcards

To reduce spatial dimensions, summarize features, and improve translation invariance.

(True/False): Pooling layers change the number of channels.

Study These Flashcards

False — they only reduce spatial dimensions.

Short Answer): Why do we use a Flatten layer before Dense layers?

Study These Flashcards

2: To convert 2D or 3D feature maps into a 1D vector suitable for fully connected layers.

(Multiple Choice): What is the role of the final Dense layer with softmax activation?
A. Extract low-level features
B. Perform classification by outputting probabilities for each class
C. Reduce spatial dimensions
D. Detect edges

Study These Flashcards

B. Perform classification by outputting probabilities for each class

(True/False): Later layers in CNNs always need more filters than earlier layers.

Study These Flashcards

: False — sometimes fewer filters are used in later layers to reduce computation while capturing abstract features.

(Short Answer): How does padding affect Conv2D output size?

Valid padding: no padding → output smaller than input Same padding: output size equals input size

(Short Answer): Why is ReLU often used as activation in CNNs?

Because it accelerates convergence and reduces vanishing gradient problems by keeping positive gradients large.

Which is a common strategy to reduce overfitting in CNNs? A. Reduce number of filters in early layers B. Use dropout or data augmentation C. Avoid pooling layers D. Remove activation functions

B. Use dropout or data augmentation

: What is quantization in deep learning?

Quantization is the process of reducing the precision of model weights and activations (e.g., from FP32 to INT8) to make the model smaller and faster.

What does precision mean in the context of neural networks?

A: Precision refers to how many bits are used to represent each number (e.g., FP32, FP16, INT8).

Q: Which precision formats are commonly used and how do they differ?

FP32: Full precision (training) FP16/BF16: Half precision (faster, almost no accuracy loss) INT8: 8-bit quantization (smaller, minor accuracy loss) INT4: 4-bit quantization (tiny, faster, more accuracy loss)

What is the main difference between int and unsigned int?

int can store negative and positive numbers, while unsigned int stores only non-negative numbers.

Q: What is asymmetric (unsymmetric) quantization?

A: Quantization that uses a zero-point offset to shift the integer range to better cover the real data range. For example: we need to present [-20.... 1000] to uint8 [0...255] if using symmetric, the -20 will be out of range -> + an offset to adjust

Q: Why can unsigned int store larger positive numbers than int?

A: Because all bits are used for magnitude instead of using one bit for the sign.

Q: Key difference between symmetric and asymmetric quantization?

A: Symmetric has zero-point 0; asymmetric has a non-zero zero-point to cover data range.

Deep Learning Flashcards

(35 cards)