Backpropagation: During backward pass, a layer needs the weight matrix W to compute what, and why?
W is needed to propagate the error signal to the layer below.
δ_i = δ_(i+1) · W_(i+1)^T∂L/∂W_i = δ_i · h_(i-1)^T (“Question A”)Backpropagation: Write the full chain rule for ∂L/∂W₁ in a 3-layer network h₁=W₁x, h₂=W₂h₁, h₃=W₃h₂, L=loss(h₃).
∂L/∂W₁ = (upstream gradient δ₁) · x^T, where the upstream gradient unfolds via chain rule:
∂L/∂W₁ = (∂L/∂h₃) · (∂h₃/∂h₂) · (∂h₂/∂h₁) · (∂h₁/∂W₁)= (h₃ − y) · W₃^T · W₂^T · x^TW_k^T — this is why vanishing/exploding gradients occurδ_i = δ_(i+1) · W_(i+1)^T recursively, then ∂L/∂W_i = δ_i · h_(i-1)^TCommunication Primitives: Do all-reduce, all-gather, and reduce-scatter use a central server? How are they actually implemented?
No central server. All modern collective ops use peer-to-peer communication via the ring algorithm.
2D bytes per GPU, regardless of GPU countCommunication Primitives: What does each GPU start with and end with for all-reduce, reduce-scatter, and all-gather?
Decode each name — the words tell you the operation:
Remembering:
- All-Reduce: same result everywhere, reduced (summed)
- Reduce-Scatter: scattered pieces, reduced (summed)
- All-Gather: same result everywhere, just gathered (no sum)
Communication Primitives: Which parallelism strategy uses reduce-scatter, and why is it the right choice?
FSDP/ZeRO-2+ uses reduce-scatter for gradient sync after backward.
D bytes (half of all-reduce’s ~2D) → less communication than DDPPyTorch: Where do gradients physically live? Does the optimizer know about the loss?
Gradients live on parameter tensors as param.grad, not on the loss or optimizer.
model.layer1.weight.grad — same shape as weightloss.backward() writes to param.grad (ADDS to existing)optimizer.step() reads param.grad, updates paramoptimizer.zero_grad() resets param.grad to zeroAdam(model.parameters())PyTorch: Why does gradient accumulation work without special code beyond delaying zero_grad()?
Because loss.backward() ADDS to param.grad by default, not overwrite.
zero_grad(): W.grad = 0loss1.backward(): W.grad = g₁loss2.backward(): W.grad = g₁ + g₂ (accumulated!)param.grad tensorLarge Batch Training: Why do very large batch sizes hurt generalization, even though the gradient is more accurate?
Three reasons:
FSDP: How does FSDP save memory if every GPU needs full weights for forward/backward?
Layer-by-layer gather — never holds full model, only one layer at a time.
N/K per GPU (only shards)N_layer temporaryN/K + N_layer — NOT full model7B model, 32 layers, K=8:
- DDP: 14 GB weight memory
- FSDP: 1.75 GB permanent + ~440 MB temp = ~2.2 GB
- Key: layers processed sequentially, never need all weights simultaneously
FSDP: Why must FSDP all-gather weights TWICE — once in forward, once in backward for the same layer?
Forward and backward for the same layer are separated in time.
N/K + N_layer instead of full modelFSDP Communication: How does FSDP’s communication compare to DDP and ZeRO-1/2?
ZeRO-2 < DDP < ZeRO-3/FSDP in total volume.
2N bytes/GPUN bytes (half of DDP!)3N bytes (~1.5× DDP)Terminology: What is the difference between ‘saved activation’ and ‘activation function’?
Completely different — don’t mix them up in interviews.
[b, s, d], not the functionMixed Precision: fp16 training produces NaN. Overflow or underflow? How to fix?
Overflow — gradients exceeding fp16 max of 65,504.
inf in fp16inf - inf or 0 × inf → NaN → propagates everywhereMixed Precision: Compare fp16 and bf16 bit layouts. Why doesn’t bf16 need loss scaling?
bf16 has fp32’s range but fp16’s size — immune to overflow.
System Design: Before proposing parallelism, what 3 baseline techniques should you always mention?
Mixed precision, activation checkpointing, gradient accumulation.
O(L) → O(√L) memory, ~33% compute costMemory Wall: Both FSDP and TP solve ‘model doesn’t fit.’ How do they differ?
FSDP shards model state. TP splits computation.
Scaling: Name the 4 bottleneck categories when adding GPUs doesn’t speed things up.
Communication, bubbles, GPU underutilization, data I/O.
(P-1)/(M+P-1) wasted. More stages → worse. Fix: more micro-batches.Scaling: State Amdahl’s Law and its implication for distributed training.
Speedup = 1 / (s + (1-s)/N) where s = serial fraction, N = processors.
Interview Technique: How should you structure a ‘compare X, Y, Z’ answer?
Open with axes, then fill in each briefly. Table format.
Gradient Accumulation: Can it reduce effective batch size? What’s the minimum?
No — accumulation only increases batch, never decreases.
local_batch × GPUs × accum_stepsaccum=1, local_batch=1 → effective = num_GPUsDDP: How does PyTorch DDP overlap communication with backward computation?
Starts all-reducing finished gradients while earlier layers still computing.
Pipeline Parallelism: What exactly flows between GPUs, and in what pattern?
Activation tensors (forward) and activation gradients (backward), point-to-point between adjacent stages.
[b,s,d], sends to GPU1∂L/∂activation back to GPU2Pipeline Parallelism: 1F1B vs GPipe — same bubble, different what?
Same bubble fraction, drastically different activation memory.
(P-1)/(M+P-1)All-Reduce Cost: What is the total communication cost of ring all-reduce?
~2D bytes per GPU, regardless of GPU count.
2 × D × (N-1)/N → approaches 2D as N grows