Mire_dis_train Flashcards

(89 cards)

1
Q

Backpropagation: During backward pass, a layer needs the weight matrix W to compute what, and why?

A

W is needed to propagate the error signal to the layer below.

  • Formula: δ_i = δ_(i+1) · W_(i+1)^T
  • This is “Question B” of backprop — passing the gradient downstream
  • Without W, the chain rule breaks: layer i-1 would have no error signal
  • The saved activation h_(i-1) is needed for a different purpose: computing ∂L/∂W_i = δ_i · h_(i-1)^T (“Question A”)
  • Interview insight: “Backward needs weights for gradient propagation, activations for weight updates”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Backpropagation: Write the full chain rule for ∂L/∂W₁ in a 3-layer network h₁=W₁x, h₂=W₂h₁, h₃=W₃h₂, L=loss(h₃).

A

∂L/∂W₁ = (upstream gradient δ₁) · x^T, where the upstream gradient unfolds via chain rule:

  • ∂L/∂W₁ = (∂L/∂h₃) · (∂h₃/∂h₂) · (∂h₂/∂h₁) · (∂h₁/∂W₁)
  • Substituting: = (h₃ − y) · W₃^T · W₂^T · x^T
  • Each “pass-through” factor is W_k^T — this is why vanishing/exploding gradients occur
  • Backprop shortcut: define δ_i = δ_(i+1) · W_(i+1)^T recursively, then ∂L/∂W_i = δ_i · h_(i-1)^T
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Communication Primitives: Do all-reduce, all-gather, and reduce-scatter use a central server? How are they actually implemented?

A

No central server. All modern collective ops use peer-to-peer communication via the ring algorithm.

  • GPUs in a logical ring: each sends right, receives left
  • Ring all-reduce transfers ~2D bytes per GPU, regardless of GPU count
  • Parameter server architecture (~2014) is dead — creates bandwidth bottleneck
  • Key fact: All-reduce = reduce-scatter + all-gather (two phases of ring algorithm)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Communication Primitives: What does each GPU start with and end with for all-reduce, reduce-scatter, and all-gather?

A

Decode each name — the words tell you the operation:

  • All-Reduce: full tensor → same full sum on all GPUs
  • Reduce-Scatter: full tensor → different shard of sum per GPU
  • All-Gather: shard → same full concatenation on all GPUs

Remembering:
- All-Reduce: same result everywhere, reduced (summed)
- Reduce-Scatter: scattered pieces, reduced (summed)
- All-Gather: same result everywhere, just gathered (no sum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Communication Primitives: Which parallelism strategy uses reduce-scatter, and why is it the right choice?

A

FSDP/ZeRO-2+ uses reduce-scatter for gradient sync after backward.

  • Each GPU only owns a shard of params/grads/optimizer
  • After backward, each GPU needs only its shard of summed gradient
  • Reduce-scatter delivers exactly that — no wasted memory
  • DDP’s all-reduce gives full gradient to everyone → wastes (K-1)/K memory
  • Reduce-scatter transfers ~D bytes (half of all-reduce’s ~2D) → less communication than DDP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

PyTorch: Where do gradients physically live? Does the optimizer know about the loss?

A

Gradients live on parameter tensors as param.grad, not on the loss or optimizer.

  • model.layer1.weight.grad — same shape as weight
  • loss.backward() writes to param.grad (ADDS to existing)
  • optimizer.step() reads param.grad, updates param
  • optimizer.zero_grad() resets param.grad to zero
  • Optimizer never sees the loss. Constructed with Adam(model.parameters())
  • Loss is ephemeral — new tensor each forward pass
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

PyTorch: Why does gradient accumulation work without special code beyond delaying zero_grad()?

A

Because loss.backward() ADDS to param.grad by default, not overwrite.

  • After zero_grad(): W.grad = 0
  • After loss1.backward(): W.grad = g₁
  • After loss2.backward(): W.grad = g₁ + g₂ (accumulated!)
  • Must divide loss by accumulation_steps to get mean not sum
  • Each backward uses different computation graph but same param.grad tensor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Large Batch Training: Why do very large batch sizes hurt generalization, even though the gradient is more accurate?

A

Three reasons:

  1. Loss of implicit regularization: SGD noise bounces you out of sharp minima into flat ones (better generalization). Large batches → less noise → settle in sharp minima (Keskar et al., 2017)
  2. Fewer updates per epoch: batch=32 gives N/32 course corrections; batch=8192 gives N/8192
  3. Linear scaling rule breaks: double batch → double LR works up to ~8K-32K, then overshooting
  • Mnemonic: “SGD noise is a free regularizer — rejects sharp minima”
  • Fixes: LR warmup, LARS/LAMB optimizers, limit max batch size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

FSDP: How does FSDP save memory if every GPU needs full weights for forward/backward?

A

Layer-by-layer gather — never holds full model, only one layer at a time.

  • Permanent: N/K per GPU (only shards)
  • Before Layer 5: all-gather full weights → +N_layer temporary
  • After Layer 5: FREE gathered weights immediately
  • Peak: N/K + N_layer — NOT full model

7B model, 32 layers, K=8:
- DDP: 14 GB weight memory
- FSDP: 1.75 GB permanent + ~440 MB temp = ~2.2 GB
- Key: layers processed sequentially, never need all weights simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FSDP: Why must FSDP all-gather weights TWICE — once in forward, once in backward for the same layer?

A

Forward and backward for the same layer are separated in time.

  • Forward: L1→L2→…→L32→loss
  • Backward: L32→L31→…→L1
  • Layer 5 forward is early, backward is late
  • Keeping weights from forward through backward = holding them the entire step → defeats FSDP
  • Free after forward, re-gather for backward → peak stays N/K + N_layer instead of full model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

FSDP Communication: How does FSDP’s communication compare to DDP and ZeRO-1/2?

A

ZeRO-2 < DDP < ZeRO-3/FSDP in total volume.

  • DDP: 1 all-reduce = ~2N bytes/GPU
  • ZeRO-2: reduce-scatter = ~N bytes (half of DDP!)
  • ZeRO-3/FSDP: 2× all-gather + 1× reduce-scatter per layer = ~3N bytes (~1.5× DDP)
  • FSDP overhead mostly hidden behind compute on NVLink
  • Common mistake: saying FSDP is “tiny” extra comm — it’s 1.5×, not negligible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Terminology: What is the difference between ‘saved activation’ and ‘activation function’?

A

Completely different — don’t mix them up in interviews.

  • Activation function: nonlinearity like ReLU, GeLU — a math operation
  • Saved activation: the intermediate tensor (h₁, h₂) stored during forward for backward use
  • “Activations consume memory” = saved tensors [b, s, d], not the function
  • Activation checkpointing = which tensors to keep vs recompute
  • Wrong: “free the activation function” — Right: “free the saved activation”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mixed Precision: fp16 training produces NaN. Overflow or underflow? How to fix?

A

Overflow — gradients exceeding fp16 max of 65,504.

  • Gradient > 65504 → inf in fp16
  • inf - inf or 0 × inf → NaN → propagates everywhere
  • Underflow causes stalled training (loss plateaus), NOT NaN
  • Fixes: (1) dynamic loss scaling, (2) switch to bf16
  • One-liner: “NaN in fp16 → gradient overflow → loss scaling or bf16”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Mixed Precision: Compare fp16 and bf16 bit layouts. Why doesn’t bf16 need loss scaling?

A

bf16 has fp32’s range but fp16’s size — immune to overflow.

  • fp16: 5 exponent, 10 mantissa, max ~65,504
  • bf16: 8 exponent, 7 mantissa, max ~3.4×10³⁸ (same as fp32)
  • bf16’s 8 exponent bits = same range as fp32 → no overflow → no loss scaling
  • bf16 has less precision than fp16 (7 vs 10 mantissa) — doesn’t matter for training
  • Mnemonic: “bf16 = fp32’s range in fp16’s size”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

System Design: Before proposing parallelism, what 3 baseline techniques should you always mention?

A

Mixed precision, activation checkpointing, gradient accumulation.

  1. bf16: ~2× faster, halves activation memory, trivial to enable
  2. Activation checkpointing: O(L)O(√L) memory, ~33% compute cost
  3. Gradient accumulation: larger effective batch without more memory
  • Say these FIRST: “Before parallelism, I’d enable bf16, activation checkpointing, and gradient accumulation.”
  • Interview insight: omitting baselines signals you jump to complex solutions first
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Memory Wall: Both FSDP and TP solve ‘model doesn’t fit.’ How do they differ?

A

FSDP shards model state. TP splits computation.

  • FSDP: stores 1/K of everything, gathers per layer temporarily. Each GPU still does full layer computation.
  • TP: each GPU holds a slice of weight matrices, does partial matrix multiply. Reduces BOTH param AND activation memory.
  • FSDP fails when: single layer too big for temp gather, OR activations from long sequences too large
  • TP helps with both — but needs NVLink, adds 2 all-reduces/layer
  • Rule: FSDP for <10B, add TP for >10B or long sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Scaling: Name the 4 bottleneck categories when adding GPUs doesn’t speed things up.

A

Communication, bubbles, GPU underutilization, data I/O.

  1. Communication: all-reduce/gather slower than compute. Slow interconnect or small model.
  2. Pipeline bubble: (P-1)/(M+P-1) wasted. More stages → worse. Fix: more micro-batches.
  3. GPU underutilization: per-GPU matrices too small for tensor cores (TP too high, batch too small).
  4. Data I/O: GPUs starved waiting for data. Fix: more workers, prefetch, NVMe.
  • Amdahl’s Law: 10% serial → max 10× speedup, infinite GPUs
  • Approach: enumerate categories first, then discuss each
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Scaling: State Amdahl’s Law and its implication for distributed training.

A

Speedup = 1 / (s + (1-s)/N) where s = serial fraction, N = processors.

  • 10% serial → max speedup = 10×, even with infinite GPUs
  • Serial portions: optimizer step, logging, data loading, pipeline bubbles, non-overlapped communication
  • 256 GPUs + 2% serial → max 50× speedup (78% efficiency)
  • This is why pure DP has diminishing returns — serial overhead dominates
  • Mentioning Amdahl’s Law shows you understand fundamental limits
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Interview Technique: How should you structure a ‘compare X, Y, Z’ answer?

A

Open with axes, then fill in each briefly. Table format.

  • Bad: meander through concepts with long paragraphs
  • Good: “They differ on 3 axes: what’s split, communication, when to use”
  • Use a mental table: rows = strategies, columns = comparison axes
  • Example axes for DP/TP/PP: what’s split, communication cost, bandwidth need, when to use
  • Finish with: “In practice we combine them: TP inside, PP across, DP everywhere”
  • Structured answers score higher with identical content
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Gradient Accumulation: Can it reduce effective batch size? What’s the minimum?

A

No — accumulation only increases batch, never decreases.

  • Effective batch = local_batch × GPUs × accum_steps
  • Minimum: accum=1, local_batch=1 → effective = num_GPUs
  • 32 GPUs → min effective batch = 32
  • If optimal batch < 32, reduce DP degree (use some GPUs for TP instead)
  • Example: TP=4, DP=8 → min effective batch = 8
  • Rarely a problem: most models train well with batch 64-512
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

DDP: How does PyTorch DDP overlap communication with backward computation?

A

Starts all-reducing finished gradients while earlier layers still computing.

  • Backward: L32→L31→…→L1 (sequential)
  • Once Layer 32’s gradient is done, it never changes
  • DDP groups params into buckets (~25 MB). Fires all-reduce per bucket when ready.
  • Communication and compute run in parallel
  • Only stall: waiting for last bucket’s all-reduce
  • What enables it: backprop is a chain — finalized gradients are independent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Pipeline Parallelism: What exactly flows between GPUs, and in what pattern?

A

Activation tensors (forward) and activation gradients (backward), point-to-point between adjacent stages.

  • Each GPU holds different layers: GPU0=L1-8, GPU1=L9-16, etc.
  • Forward: GPU0 outputs activation [b,s,d], sends to GPU1
  • Backward: GPU3 sends ∂L/∂activation back to GPU2
  • Micro-batches flow like a conveyor belt: μ₁ at GPU0→1→2→3 over 4 time steps
  • While μ₁ is at GPU1, μ₂ enters GPU0 → stages active simultaneously
  • No all-reduce — only neighbor-to-neighbor sends
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Pipeline Parallelism: 1F1B vs GPipe — same bubble, different what?

A

Same bubble fraction, drastically different activation memory.

  • Both: bubble = (P-1)/(M+P-1)
  • GPipe: all M forwards then all M backwards. Stores M micro-batches of activations.
  • 1F1B: interleave F and B after fill. Stores at most P micro-batches.
  • M=32, P=4: GPipe=32× activations, 1F1B=4× → 8× less memory
  • 1F1B is standard (PipeDream, Megatron-LM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

All-Reduce Cost: What is the total communication cost of ring all-reduce?

A

~2D bytes per GPU, regardless of GPU count.

  • D = gradient tensor size
  • Two phases: reduce-scatter (~D) + all-gather (~D) = ~2D
  • Precisely: 2 × D × (N-1)/N → approaches 2D as N grows
  • Per-GPU cost independent of N — adding GPUs doesn’t increase communication
  • This is why DP scales well: throughput grows linearly, communication constant
  • Naive approach (send to one node): O(N×D) bottleneck — terrible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
*System Design:* When specifically is pipeline parallelism needed?
**When TP is maxed within a node and the model still doesn't fit on one node.** - TP limited to intra-node: max ~8 GPUs (NVLink) - If after TP=8, model must spread across nodes → PP bridges nodes - PP works over InfiniBand: only P-1 point-to-point sends (not 2L all-reduces) - **Never used alone** — always with TP+DP in 3D parallelism - Recipe: TP=8 within node, PP=4 across nodes, DP=K across groups - For <10B models: usually no PP needed (FSDP±TP suffices) - Essential at 30B+ with 100s of GPUs
26
*Memory Anatomy:* What are the 4 components that consume GPU memory during training, and which is fixed vs variable?
**Parameters, gradients, optimizer state (fixed), and activations (variable).** - **Parameters** (θ): the model weights - **Gradients** (∇θ): one per parameter - **Optimizer state**: Adam stores momentum (m) and variance (v) - **Activations**: intermediate outputs saved for backward - Params + grads + optimizer = **fixed** (model size only) - Activations = **variable** (batch, seq len, hidden dim) - *"Model state is fixed at ~16N bytes; activations are variable and often dominate"*
27
*Memory Anatomy:* For N params with Adam mixed precision (bf16), give the memory breakdown.
**Total = 16N bytes.** - Master weights (fp32): 4N - bf16 weight copy: 2N - Gradients (bf16): 2N - Adam momentum m (fp32): 4N - Adam variance v (fp32): 4N - 1B params → 16 GB. A100 = 80 GB → max ~5B state before activations - *Mnemonic*: "The model is cheap, Adam is expensive" (optimizer = 8N = half)
28
*Communication:* What is all-reduce and when is it used?
**Every GPU contributes a tensor → every GPU gets the same sum.** - START: GPU0=[A] GPU1=[B] → END: all=[A+B] - **Used by DDP**: average gradients so all GPUs have identical result - Ring algorithm: ~`2D` bytes/GPU, independent of GPU count - All-Reduce = Reduce-Scatter + All-Gather
29
*Communication:* What is reduce-scatter and when is it used?
**Every GPU contributes full tensor → each gets a different shard of the sum.** - START: all have full tensor → END: each has different reduced shard - **Used by FSDP/ZeRO-2+**: gradient sync (each GPU only needs its shard) - Cost: ~`D` bytes/GPU — **half** of all-reduce - *Mnemonic*: "Reduce then Scatter — each gets a different reduced chunk"
30
*Communication:* What is all-gather and when is it used?
**Each GPU has a shard → every GPU gets the full concatenated data.** - START: each has a shard → END: all have full data (no summing, just concat) - **Used by FSDP**: reconstruct full layer weights before forward/backward - Cost: ~`D` bytes/GPU - *Mnemonic*: "All get the Gathered data"
31
*Communication:* Key interconnect bandwidths and their impact on parallelism?
**10-50× gap between intra-node and inter-node dictates placement.** - **NVLink**: 600-900 GB/s (within node) - **InfiniBand**: 25-50 GB/s (across nodes) - **Ethernet**: 1-12.5 GB/s (budget) - This gap → TP (comm-heavy) stays intra-node; PP/DP go inter-node - 440 MB layer: NVLink ~0.7ms, InfiniBand ~10ms
32
*Backpropagation:* At each layer, what 2 computations and what does each need?
**Question A (weight gradient) and Question B (error propagation).** - **A**: `∂L/∂W_i = δ_i · h_(i-1)^T` — needs **saved activation** → computes weight gradient for optimizer - **B**: `δ_(i-1) = δ_i · W_i^T` — needs **weight matrix** → propagates error signal downstream - Without A: can't update weights. Without B: chain breaks. - This is why both activations AND weights must be in memory during backward.
33
*Communication:* Express: All-Reduce = Reduce-Scatter + All-Gather. Why does this matter?
**It's literally how ring all-reduce works. Explains FSDP communication.** - Reduce-scatter: ~D bytes. All-gather: ~D bytes. All-reduce: ~2D total. - DDP: all-reduce (~2D) — needs full result everywhere - FSDP backward: reduce-scatter only (~D) — each GPU only needs its shard - FSDP forward: all-gather only (~D) — reconstruct weights from shards - FSDP total ~3D/layer vs DDP ~2D total = ~1.5× more comm
34
*Training Basics:* What are the two 'walls' motivating multi-GPU training?
**Memory wall (can't fit) and time wall (too slow).** - **Memory wall**: model state + activations > GPU memory → OOM crash - 7B model = 112 GB > A100's 80 GB - **Time wall**: single GPU too slow → impractical wait - Gradient accumulation: helps memory wall (activation part only) - Multi-GPU: helps both walls
35
*Gradient Accumulation:* How does it work in PyTorch?
**Multiple forward+backward on micro-batches, accumulate grads, one optimizer step.** - `loss = model(micro_batch) / accum_steps` ← scale! - `loss.backward()` ← adds to param.grad - Every K steps: `optimizer.step()` then `optimizer.zero_grad()` - **Solves**: want batch 256, only 32 fits → 8 accum steps - Does NOT speed up — same total compute - Must divide loss by K to get mean not sum
36
*Gradient Accumulation:* Pros, cons, limits?
**Memory trick only. No speedup. Can't decrease batch below num_GPUs.** - **Pro**: zero comm cost, mathematically identical to larger batch - **Con**: no speedup, each step K× longer - Only helps when **activations** are bottleneck (not model state — 16N is fixed) - Can only increase batch, never decrease — min = num_GPUs × 1 - *First thing to try* before multi-GPU
37
*PyTorch:* Three key objects and how they interact with parameters?
**Model, loss, optimizer — all act on param.grad.** - `model`: holds param tensors with `.grad` attribute - `loss.backward()`: **writes to** param.grad (ADDS to existing) - `optimizer.step()`: reads param.grad, updates param - `optimizer.zero_grad()`: resets param.grad to zero - **Optimizer never sees loss** — constructed with model.parameters() - Loss is ephemeral — new each forward pass
38
*PyTorch:* Why does loss.backward() add instead of overwrite?
**Enables gradient accumulation automatically.** - Default: `param.grad += new_gradient` - Multiple backward() calls sum on same param.grad - `zero_grad()` resets the accumulation - Forgetting zero_grad = stale gradients from previous steps - Each backward uses different graph but same param.grad tensor
39
*Gradient Accumulation:* Why divide loss by accumulation steps?
**To get the mean gradient, matching full-batch behavior.** - Without: 4 backwards → grad = g₁+g₂+g₃+g₄ (4× too large) - With loss/4: each contributes g_i/4 → sum = mean - Forgetting = effective LR is K× too high → divergence - *Most common bug* in gradient accumulation implementations
40
*DDP:* Core idea in 5 steps?
**Copy model everywhere, split data, all-reduce gradients.** 1. Replicate full model on every GPU 2. Split batch — one slice per GPU 3. Each GPU: forward + backward independently 4. All-reduce gradients → same averaged gradient everywhere 5. Each GPU: optimizer step independently → identical weights - ~linear throughput (8 GPUs ≈ 7.5× speed) - **No memory savings** — each GPU holds full 16N - Effective batch = B_local × K
41
*DDP:* Why is it mathematically exact?
**Splitting and averaging per-GPU gradients = full-batch gradient.** - Full batch: ∇L = (1/B) Σ ∇ℓ(xᵢ) - GPU k: g_k = (1/(B/K)) Σᵢ∈shard_k ∇ℓ(xᵢ) - (1/K) Σ g_k = (1/B) Σ ∇ℓ(xᵢ) — identical! - No approximation — why sync DDP is preferred over async
42
*DDP:* Sync vs async parallelism — what and why async is dead?
**Sync: all wait for all-reduce. Async: update without waiting → stale gradients.** - **Sync** (used): all finish → all-reduce → all step. Same weights everywhere. - **Async** (dead): push gradients without waiting. Some GPUs use stale weights. - Stale gradient problem → noisy, unstable, worse convergence - Died because ring all-reduce is fast enough now - *"Async trades freshness for speed, modern all-reduce makes it unnecessary"*
43
*DDP:* How does PyTorch overlap communication with backward?
**Starts all-reducing finished layers while earlier layers still computing.** - Backward: last→first layer (sequential) - Once a layer's gradient is done, it **never changes** - DDP groups into **buckets** (~25 MB), fires all-reduce per bucket when ready - Comm and compute **overlap** - Only stall: last bucket finishing
44
*DDP:* Scaling limits?
**Model must fit on 1 GPU, and batch size grows with GPU count.** - **Hard**: 16N must fit per GPU. 7B=112 GB > 80 GB → fails - **Soft**: eff. batch = B×K. 256 GPUs × 32 = 8192 → convergence issues - Batch fixes: LR warmup + linear scaling, LARS/LAMB - *Upgrade*: can't fit → FSDP; need model split → TP/PP
45
*Large Batch:* Why does SGD noise help generalization?
**Noise bounces you out of sharp minima into flat ones — flat = better generalization.** - Sharp minima: low train loss, steep walls → fragile at test time - Flat minima: wide valley → robust to perturbations - Small batch: noisy grads → can't stay in sharp minima → finds flat ones - Large batch: exact grads → descends into sharp minima and stays - Fewer updates/epoch = fewer course corrections - *"SGD noise is a free regularizer"* (Keskar 2017)
46
*DDP:* Communication cost and why it scales well?
**~2D bytes/GPU via ring all-reduce, independent of GPU count.** - D = gradient size (params × bytes) - Ring: 2D×(N-1)/N → approaches 2D - Per-GPU cost doesn't grow with N → near-linear scaling - Overlaps with backward (bucket trick) - Bottleneck: small models where compute < communication
47
*Tensor Parallelism:* Core idea and what problem it solves?
**Split weight matrices across GPUs — each holds a slice of each layer.** - DDP: full model each GPU → fails when too large - TP: each GPU holds 1/T of each weight matrix - Reduces **both** parameter AND activation memory - Attention heads split naturally (64/8 = 8 heads/GPU) - Mathematically exact - *"DDP splits data. TP splits weights."*
48
*Tensor Parallelism:* Column-parallel vs row-parallel for Y = X·W?
**Column splits output dim (no comm). Row splits input dim (needs all-reduce).** - **Column** W=[W₁|W₂]: same input X, each GPU gets partial output (different columns). No comm. - **Row** W=[W₁;W₂]: split input, each GPU gets partial sum of full output. Needs **all-reduce** to combine. - Column→Row pairing minimizes communication in transformers
49
*Tensor Parallelism:* How does Megatron minimize FFN communication?
**Column-parallel W₁ → GeLU → row-parallel W₂ → one all-reduce. 1 comm for 2 linears.** - W₁ column: full X in, each GPU gets partial hidden - GeLU: applied per GPU independently - W₂ row: partial hidden IS the right split for row input - W₂ output = partial sums → one all-reduce → final result - Without trick: 2 all-reduces. Trick saves 50% comm.
50
*Tensor Parallelism:* How is multi-head attention parallelized?
**Q/K/V column-parallel → local attention → W_O row-parallel → all-reduce.** - 64 heads, TP=8 → 8 heads/GPU 1. Q,K,V projections (column): each GPU for its heads. No comm. 2. Attention (local): independent per head. No comm. 3. W_O (row): partial sums 4. All-reduce → final result - 1 all-reduce for attention block. Total per layer: **2 all-reduces** (1 FFN + 1 attn)
51
*Tensor Parallelism:* Communication cost and why NVLink required?
**2 all-reduces/layer, ~O(b·s·d) each. Per-layer frequency demands NVLink.** - b=4, s=2048, d=8192, bf16 → ~128 MB per all-reduce - 2/layer × 40 layers = ~10 GB per forward pass - NVLink (600 GB/s): ~17 ms — hidden behind compute - InfiniBand (50 GB/s): ~200 ms — massive stalls - *Rule*: TP within one node only. Max degree = GPUs/node (~8)
52
*Tensor Parallelism:* Memory savings vs FSDP?
**TP reduces both weights AND activations. FSDP only reduces weights/optimizer.** - TP: layer weights /T, activations /T, optimizer /T - FSDP: shards model state but activations stay full-size per GPU - TP helps for long sequences even after FSDP handles state - Efficiency drops at high TP (matrices too small for tensor cores) - Typical: TP = 2, 4, or 8
53
*Tensor Parallelism:* Why are attention heads 'embarrassingly parallel'?
**Each head operates independently — no cross-head interaction until output projection.** - Head h: softmax(Q_h·K_h^T/√d_k)·V_h — only its own Q,K,V - 64 heads on 8 GPUs = 8/GPU, full independence - Only at W_O do results combine: row-parallel + all-reduce - One comm event for all heads - *This natural parallelism is why TP was designed for transformers*
54
*Pipeline Parallelism:* Core idea and how it differs from TP?
**Split model by layers into stages. TP splits within layers, PP splits between layers.** - GPU0: L1-8, GPU1: L9-16, etc. Each holds ~N/P params - Comm: **point-to-point** activations between adjacent stages - Works across nodes (InfiniBand — infrequent sends) - *"TP cuts horizontally (weight matrices). PP cuts vertically (layer groups)."*
55
*Pipeline Parallelism:* Bubble problem and formula?
**Idle GPU time from sequential deps. Bubble = (P-1)/(M+P-1).** - P = stages, M = micro-batches - Naive M=1: 1/P utilization (terrible) - P=4, M=16: 16% bubble - P=4, M=32: 9% bubble - Bubble = fill + drain triangles at pipeline edges - *Rough rule*: ≈ 1/M when M >> P
56
*Pipeline Parallelism:* How do micro-batches flow through the pipeline?
**Like a conveyor belt — each travels through all stages, one time step per stage.** - T=0: GPU0 processes μ₁. T=1: GPU0→μ₂, GPU1→μ₁. T=2: three GPUs active. - Forward: activation [b,s,d] sent GPU_i → GPU_(i+1) - Backward: gradient flows GPU_(i+1) → GPU_i - Multiple stages active simultaneously = pipeline is "full" - Only adjacent GPUs communicate (point-to-point)
57
*Pipeline Parallelism:* GPipe vs 1F1B — same bubble, different...?
**Different peak activation memory. 1F1B stores P, GPipe stores M.** - GPipe: all M forwards then all M backwards → holds all M activations - 1F1B: interleave after fill → backward frees activations early → holds only P - M=32, P=4: GPipe 32×, 1F1B 4× → **8× less memory** - Same bubble fraction for both - 1F1B is standard (PipeDream, Megatron-LM)
58
*Pipeline vs Tensor Parallelism:* Communication comparison?
**PP: few P2P sends. TP: many all-reduces. PP needs far less bandwidth.** - TP: 2× per layer → 80 all-reduces for 40 layers - PP: P-1 per forward → 3 sends for 4 stages - Both ~O(b·s·d) per message, but 80 vs 3 events! - *This is why* TP = NVLink (intra-node), PP = InfiniBand (inter-node)
59
*Pipeline Parallelism:* Memory savings?
**Each GPU holds only 1/P of params and optimizer state.** - Weights: N/P. Optimizer: 16N/P bytes. - Activations: 1F1B O(P), GPipe O(M) micro-batches - True model memory reduction (unlike DDP = full copy) - With TP: each GPU holds 1/(T×P) of total model - Risk: load imbalance if layers are uneven sizes
60
*Pipeline Parallelism:* Main downsides?
**Bubble, complexity, load imbalance.** 1. Bubble: unavoidable (P-1)/(M+P-1) 2. Scheduling complexity: multiple micro-batches, interleaving F/B 3. Load imbalance: uneven layer sizes → stages waiting 4. Small micro-batches → less GPU utilization each - Rarely used alone — always with TP + DP - Mainly for cross-node model splitting
61
*ZeRO:* Core insight — what redundancy does DDP have?
**Each GPU has identical full 16N copy. Only 16N unique → waste.** - DDP: 16N × K GPUs. Only 16N unique bytes. - ZeRO: each GPU owns 1/K of state - Stage 1: shard optimizer (8N/K each) - Stage 2: + gradients - Stage 3/FSDP: + params → 16N/K each - Computation identical to DDP — only storage changes
62
*ZeRO:* Memory per GPU at each stage (K GPUs)?
**Each stage shards one more component.** - DDP: 16N | ZeRO-1: 8N+8N/K | ZeRO-2: 6N+10N/K | FSDP: **16N/K** - K=8: DDP 16N → ZeRO-1 9N → ZeRO-2 7.25N → FSDP **2N** - 7B: DDP=112 GB → FSDP w/8 GPUs = **14 GB** per GPU - ZeRO-2 also reduces comm (reduce-scatter < all-reduce) - ZeRO-3 increases comm ~1.5× vs DDP
63
*FSDP:* Describe the 'dance' per layer in forward and backward.
**All-gather → compute → free. Per layer, twice (fwd + bwd).** *Forward*: shard only → all-gather full W → forward → FREE W → next layer *Backward*: all-gather full W again → backward → reduce-scatter grads → FREE W - Total per layer: 2× all-gather + 1× reduce-scatter = 3 comm ops - Full weights exist only **temporarily** during one layer's compute
64
*FSDP:* Why does layer-by-layer gather save memory?
**Only ONE layer's full weights at a time — not the whole model.** - Peak = shards (N/K) + one full layer (N_layer). NOT full model. - 7B, 32 layers, K=8: 1.75 GB + 440 MB = **2.2 GB** (vs DDP 14 GB) - Layers processed sequentially → never need all weights simultaneously - Why gather twice: fwd L5 and bwd L5 are ~60 layers apart in time - *Library analogy*: desk fits 1 book, drawer holds 1/4 of each
65
*FSDP:* What does it stand for? Relation to ZeRO?
**Fully Sharded Data Parallelism = PyTorch-native ZeRO Stage 3.** - Fully Sharded: all 3 components sharded (params, grads, optimizer) - Data Parallel: same concept as DDP (different data per GPU) - `torch.distributed.fsdp` — no external library - DeepSpeed ZeRO: Microsoft's impl, more features (CPU/NVMe offload) - FSDP for <10B PyTorch; DeepSpeed for 10B+ with offloading
66
*FSDP vs DDP:* Communication comparison?
**FSDP ~1.5× DDP volume, in many small events vs few large.** - DDP: ~2N bytes, ~1 event (bucketed all-reduce) - FSDP: ~3N bytes, 3L events (per layer) - On NVLink: within 5-15% of DDP throughput (overlap) - Across nodes: overhead increases - *Note*: ZeRO-2 communicates LESS than DDP (reduce-scatter = half)
67
*ZeRO-2:* Why does reduce-scatter reduce communication vs DDP?
**Reduce-scatter is literally half of all-reduce.** - All-reduce = reduce-scatter + all-gather = ~2D - ZeRO-2: only reduce-scatter = ~D (50% less!) - Each GPU only needs its gradient shard → full result unnecessary - Small all-gather after optimizer for updated params - Net: ZeRO-2 comm < DDP — a free improvement
68
*FSDP:* When does it fail, requiring TP or PP?
**Slow inter-node comm, batch explosion, single layer too big.** 1. Inter-node: per-layer all-gathers over InfiniBand can't hide 2. Batch: 512 GPUs × local_batch = huge eff. batch → convergence issues 3. Layer too large: temp gathered weights exceed GPU memory - Rule: FSDP <10B; +TP 10-30B; full 3D 30B+ - Mirelo 1-3B: FSDP alone sufficient
69
*3D Parallelism:* What is it? Total GPU formula?
**DP × TP × PP on a 3D mesh. Total = DP × TP × PP.** - TP: weight matrices (intra-layer) - PP: layer stages (inter-layer) - DP/FSDP: training data - Example: 256 = TP=8 × PP=4 × DP=8 - FSDP can replace DDP on DP axis
70
*3D Parallelism:* The golden rule for mapping to hardware?
**"TP inside, PP across, DP everywhere."** - TP: very high BW → within node (NVLink 600+ GB/s) - PP: moderate BW → across nodes (InfiniBand 25-50 GB/s) - DP/FSDP: low BW → anywhere - Violating (TP across nodes) = massive slowdowns - 1 node = 1 TP group; nodes → pipeline; pipelines → DP replicas
71
*3D Parallelism:* Concrete 70B config on 256 GPUs (32×8)?
**TP=8, PP=4, DP=8. Each GPU: 1/8 of layers for 1/4 of model.** - TP=8: all GPUs in node split weights (NVLink) - PP=4: 4 nodes form pipeline - DP=8: 8 pipeline groups, different data - Per GPU: 70B/(8×4) ≈ 2.2B → ~4.4 GB state. Plenty on 80 GB A100. - Flow: TP within node, pipeline across, DP overlapped
72
*Scale Guidance:* Model size → parallelism strategy?
**Use simplest that works.** - <1B: DDP - 1-10B: FSDP alone - 10-30B: TP + FSDP - 30B+: TP + PP + FSDP (full 3D) - Always start simple, add complexity only if needed - Mirelo 1-3B: FSDP on 1-4 nodes
73
*3D Parallelism:* Where does FSDP fit?
**Replaces DDP on data-parallel axis — same role, much less memory.** - Classic 3D: DP replicas hold full model - FSDP: state sharded across DP group (16N/K instead of 16N) - <10B: FSDP alone handles memory AND speed - Larger: FSDP outer axis + TP intra-node + PP inter-node
74
*3D Parallelism:* Trace one training step through TP=8, PP=4, DP=8.
**Data split → pipeline with TP inside → DP gradient sync.** 1. Each of 8 DP replicas gets different batch 2. Forward: pipeline stages compute with TP all-reduces within nodes 3. Activations flow stage→stage across nodes (1F1B schedule) 4. Backward: reverse, TP all-reduces per stage 5. DP sync: all-reduce grads across 8 replicas (overlapped) 6. Optimizer: each GPU updates its shard
75
*Mixed Precision:* Core idea and why not everything in bf16?
**Compute in bf16, master weights in fp32. Optimizer updates too small for bf16.** - Tensor cores 2× faster in bf16 - But: lr × gradient can be smaller than half-precision minimum - Updates round to zero → training stalls - Solution: master weights fp32 for optimizer, cast to bf16 for compute - Total state: still 16N (4N master + 2N bf16 + 2N grads + 8N Adam)
76
*Mixed Precision:* fp16 vs bf16 — bits, range, precision?
**bf16 = fp32's range in fp16's size. bf16 is industry standard.** - fp16: 5 exp, 10 mantissa, max ~65,504, needs loss scaling - bf16: **8 exp**, 7 mantissa, max ~3.4×10³⁸, no loss scaling - bf16: 8 exp = same range as fp32 → no overflow - bf16: less precise than fp16 (7 vs 10) — doesn't matter for training - *"bf16 = fp32's range in fp16's size"*
77
*Mixed Precision:* Dynamic loss scaling — what and when needed?
**Scale up loss to prevent fp16 gradient underflow. Only for fp16, not bf16.** 1. scaled_loss = loss × scale (e.g., 1024) 2. backward → grads 1024× larger (above underflow) 3. Before optimizer: divide by scale 4. If inf: skip step, reduce scale 5. Stable: increase scale - Dynamic: adapts during training - bf16 doesn't need it (range = fp32)
78
*Debugging:* fp16 training produces NaN. Cause?
**Gradient overflow — exceeds fp16 max of 65,504.** - Gradient > 65,504 → `inf` → `inf-inf` → NaN → propagates - **Overflow** = NaN. **Underflow** (→0) = stalled training, not NaN. - Fix: (1) dynamic loss scaling (2) switch to bf16 - *"NaN in fp16 → gradient overflow → loss scaling or bf16"*
79
*Activation Checkpointing:* What is it and what trade-off?
**Don't save all activations — recompute during backward. Trades compute for memory.** - Standard: save every layer → O(L) memory - Checkpoint every √L layers, recompute between - **Memory: O(L) → O(√L)** - **Compute overhead: ~33%** (one extra forward pass) - Essential for large models + long sequences (audio!)
80
*Activation Checkpointing:* Why √L and why 33% overhead?
**√L balances checkpoints vs recompute. Extra forward ≈ 1/3 total.** - √L checkpoints → √L segments of √L layers each - Recompute: √L × √L = L layers = 1 extra full forward - Forward ≈ 1/3 total compute → 33% overhead - Alternatives: every layer O(1)/100%; none O(L)/0% - √L = sweet spot
81
*Activation Checkpointing:* How big are activations? When essential?
**Can exceed model state. Essential for 1B+ and long sequences.** - Per layer: ~b×s×d×2×10 bytes - 1B (24 layers, d=2048), b=8, s=4096, bf16: ~1.3 GB/layer → **31 GB** (2× model state!) - With √24≈5 checkpoints: ~6.5 GB → 5× reduction - Essential: 1B+ models, long audio, large batches - Attention: O(s²) without FlashAttention
82
*System Design:* 3 baseline techniques before any parallelism?
**bf16, activation checkpointing, gradient accumulation. Always.** 1. bf16: 2× speed, half activation memory, one flag 2. Activation checkpointing: O(√L) memory, ~33% compute 3. Gradient accumulation: batch control without more memory - "Before parallelism: bf16, checkpointing, gradient accumulation." - Omitting these = jumping to complexity without trying simple first
83
*System Design:* Structured approach to choosing parallelism?
**Memory math → baselines → simplest parallelism → add complexity.** 1. 16N bytes. Fits on 80 GB? 2. Enable bf16, checkpointing, grad accum 3. Try FSDP alone (16N/K). Fits? 4. Not enough → add TP within nodes 5. Still not → add PP across nodes 6. Verify batch size, comm overhead, utilization - **Commit to concrete answer** with numbers, don't just explore
84
*Interview:* How to structure 'compare X, Y, Z' answers?
**Name axes first, then fill in. Use mental table.** - "They differ on 3 axes: what's split, communication, when to use." - DP: data, 1 AR/step, low BW, model fits - TP: weights, 2 AR/layer, high BW, intra-node - PP: layers, P-1 P2P, moderate BW, cross-node - Close: "Combine: TP inside, PP across, DP everywhere." - *Structured answers score higher with same content*
85
*Scaling:* 4 bottleneck categories when GPUs don't speed things up?
**Communication, bubbles, GPU underutilization, data I/O.** 1. Communication: all-reduce > compute. Slow interconnect/small model. 2. Bubble: (P-1)/(M+P-1) idle. More stages → worse. 3. GPU underutilization: matrices too small (high TP, small batch). 4. Data I/O: GPUs starved. Fix: more workers, NVMe. - **Amdahl's Law**: 10% serial → max 10× speedup, infinite GPUs - Enumerate all 4, then diagnose
86
*Quick Math:* 3B + Adam + mixed precision. Memory? A100?
**48 GB state → fits A100 (80 GB), 32 GB left.** - 3B × 16 = 48 GB model state - A100 80 GB - 48 = 32 GB for activations - With checkpointing: fine for moderate batch - Long audio: might be tight → FSDP/TP - Contrast: 7B × 16 = 112 GB → doesn't fit
87
*Mirelo:* Likely training setup for 1-3B audio model?
**FSDP across 8-32 GPUs, bf16, activation checkpointing.** - 1.5B: 24 GB state → fits one A100, but audio = huge activations - FSDP w/8: 3 GB state → 77 GB free - bf16: standard. Checkpointing: essential for 30s audio - Grad accum: batch control - TP/PP: probably unnecessary at this scale - Simple, PyTorch-native, sufficient
88
*Amdahl's Law:* Statement + diminishing returns example?
**Speedup = 1/(s + (1-s)/N). Serial fraction caps max speedup.** - 5% serial: 8 GPUs→6×, 64 GPUs→15.4×, 256 GPUs→18.6×, ∞→20× - Efficiency drops: 75%→24%→7% - Serial: data loading, optimizer, logging, bubbles, non-overlapped comm - Pure DP has diminishing returns — serial overhead dominates - *Mentioning Amdahl shows fundamental understanding*
89
*Summary Table:* All parallelism strategies at a glance?
**Each splits a different dimension.** - **DDP**: data | no savings | 1 AR/step | low BW | model fits - **FSDP**: data+state | 16N/K | 3/layer | med BW | state too large - **TP**: weights | /T (incl activations) | 2 AR/layer | v.high | intra-node - **PP**: layers | N/P | P-1 P2P | med | cross-node, huge model - Plus: bf16 (2× speed), checkpointing (√L), grad accum (batch control)