Mire_dis_train Flashcards

Question

*System Design:* When specifically is pipeline parallelism needed?

Answer 1

**When TP is maxed within a node and the model still doesn't fit on one node.** - TP limited to intra-node: max ~8 GPUs (NVLink) - If after TP=8, model must spread across nodes → PP bridges nodes - PP works over InfiniBand: only P-1 point-to-point sends (not 2L all-reduces) - **Never used alone** — always with TP+DP in 3D parallelism - Recipe: TP=8 within node, PP=4 across nodes, DP=K across groups - For <10B models: usually no PP needed (FSDP±TP suffices) - Essential at 30B+ with 100s of GPUs

Answer 2

**Parameters, gradients, optimizer state (fixed), and activations (variable).** - **Parameters** (θ): the model weights - **Gradients** (∇θ): one per parameter - **Optimizer state**: Adam stores momentum (m) and variance (v) - **Activations**: intermediate outputs saved for backward - Params + grads + optimizer = **fixed** (model size only) - Activations = **variable** (batch, seq len, hidden dim) - *"Model state is fixed at ~16N bytes; activations are variable and often dominate"*

Answer 3

**Total = 16N bytes.** - Master weights (fp32): 4N - bf16 weight copy: 2N - Gradients (bf16): 2N - Adam momentum m (fp32): 4N - Adam variance v (fp32): 4N - 1B params → 16 GB. A100 = 80 GB → max ~5B state before activations - *Mnemonic*: "The model is cheap, Adam is expensive" (optimizer = 8N = half)

Answer 4

**Every GPU contributes a tensor → every GPU gets the same sum.** - START: GPU0=[A] GPU1=[B] → END: all=[A+B] - **Used by DDP**: average gradients so all GPUs have identical result - Ring algorithm: ~`2D` bytes/GPU, independent of GPU count - All-Reduce = Reduce-Scatter + All-Gather

Answer 5

**Every GPU contributes full tensor → each gets a different shard of the sum.** - START: all have full tensor → END: each has different reduced shard - **Used by FSDP/ZeRO-2+**: gradient sync (each GPU only needs its shard) - Cost: ~`D` bytes/GPU — **half** of all-reduce - *Mnemonic*: "Reduce then Scatter — each gets a different reduced chunk"

Answer 6

**Each GPU has a shard → every GPU gets the full concatenated data.** - START: each has a shard → END: all have full data (no summing, just concat) - **Used by FSDP**: reconstruct full layer weights before forward/backward - Cost: ~`D` bytes/GPU - *Mnemonic*: "All get the Gathered data"

Answer 7

**10-50× gap between intra-node and inter-node dictates placement.** - **NVLink**: 600-900 GB/s (within node) - **InfiniBand**: 25-50 GB/s (across nodes) - **Ethernet**: 1-12.5 GB/s (budget) - This gap → TP (comm-heavy) stays intra-node; PP/DP go inter-node - 440 MB layer: NVLink ~0.7ms, InfiniBand ~10ms

Answer 8

**Question A (weight gradient) and Question B (error propagation).** - **A**: `∂L/∂W_i = δ_i · h_(i-1)^T` — needs **saved activation** → computes weight gradient for optimizer - **B**: `δ_(i-1) = δ_i · W_i^T` — needs **weight matrix** → propagates error signal downstream - Without A: can't update weights. Without B: chain breaks. - This is why both activations AND weights must be in memory during backward.

Answer 9

**It's literally how ring all-reduce works. Explains FSDP communication.** - Reduce-scatter: ~D bytes. All-gather: ~D bytes. All-reduce: ~2D total. - DDP: all-reduce (~2D) — needs full result everywhere - FSDP backward: reduce-scatter only (~D) — each GPU only needs its shard - FSDP forward: all-gather only (~D) — reconstruct weights from shards - FSDP total ~3D/layer vs DDP ~2D total = ~1.5× more comm

Answer 10

**Memory wall (can't fit) and time wall (too slow).** - **Memory wall**: model state + activations > GPU memory → OOM crash - 7B model = 112 GB > A100's 80 GB - **Time wall**: single GPU too slow → impractical wait - Gradient accumulation: helps memory wall (activation part only) - Multi-GPU: helps both walls

Answer 11

**Multiple forward+backward on micro-batches, accumulate grads, one optimizer step.** - `loss = model(micro_batch) / accum_steps` ← scale! - `loss.backward()` ← adds to param.grad - Every K steps: `optimizer.step()` then `optimizer.zero_grad()` - **Solves**: want batch 256, only 32 fits → 8 accum steps - Does NOT speed up — same total compute - Must divide loss by K to get mean not sum

Answer 12

**Memory trick only. No speedup. Can't decrease batch below num_GPUs.** - **Pro**: zero comm cost, mathematically identical to larger batch - **Con**: no speedup, each step K× longer - Only helps when **activations** are bottleneck (not model state — 16N is fixed) - Can only increase batch, never decrease — min = num_GPUs × 1 - *First thing to try* before multi-GPU

Answer 13

**Model, loss, optimizer — all act on param.grad.** - `model`: holds param tensors with `.grad` attribute - `loss.backward()`: **writes to** param.grad (ADDS to existing) - `optimizer.step()`: reads param.grad, updates param - `optimizer.zero_grad()`: resets param.grad to zero - **Optimizer never sees loss** — constructed with model.parameters() - Loss is ephemeral — new each forward pass

Answer 14

**Enables gradient accumulation automatically.** - Default: `param.grad += new_gradient` - Multiple backward() calls sum on same param.grad - `zero_grad()` resets the accumulation - Forgetting zero_grad = stale gradients from previous steps - Each backward uses different graph but same param.grad tensor

Answer 15

**To get the mean gradient, matching full-batch behavior.** - Without: 4 backwards → grad = g₁+g₂+g₃+g₄ (4× too large) - With loss/4: each contributes g_i/4 → sum = mean - Forgetting = effective LR is K× too high → divergence - *Most common bug* in gradient accumulation implementations

Answer 16

**Copy model everywhere, split data, all-reduce gradients.** 1. Replicate full model on every GPU 2. Split batch — one slice per GPU 3. Each GPU: forward + backward independently 4. All-reduce gradients → same averaged gradient everywhere 5. Each GPU: optimizer step independently → identical weights - ~linear throughput (8 GPUs ≈ 7.5× speed) - **No memory savings** — each GPU holds full 16N - Effective batch = B_local × K

Answer 17

**Splitting and averaging per-GPU gradients = full-batch gradient.** - Full batch: ∇L = (1/B) Σ ∇ℓ(xᵢ) - GPU k: g_k = (1/(B/K)) Σᵢ∈shard_k ∇ℓ(xᵢ) - (1/K) Σ g_k = (1/B) Σ ∇ℓ(xᵢ) — identical! - No approximation — why sync DDP is preferred over async

Answer 18

**Sync: all wait for all-reduce. Async: update without waiting → stale gradients.** - **Sync** (used): all finish → all-reduce → all step. Same weights everywhere. - **Async** (dead): push gradients without waiting. Some GPUs use stale weights. - Stale gradient problem → noisy, unstable, worse convergence - Died because ring all-reduce is fast enough now - *"Async trades freshness for speed, modern all-reduce makes it unnecessary"*

Answer 19

**Starts all-reducing finished layers while earlier layers still computing.** - Backward: last→first layer (sequential) - Once a layer's gradient is done, it **never changes** - DDP groups into **buckets** (~25 MB), fires all-reduce per bucket when ready - Comm and compute **overlap** - Only stall: last bucket finishing

Answer 20

**Model must fit on 1 GPU, and batch size grows with GPU count.** - **Hard**: 16N must fit per GPU. 7B=112 GB > 80 GB → fails - **Soft**: eff. batch = B×K. 256 GPUs × 32 = 8192 → convergence issues - Batch fixes: LR warmup + linear scaling, LARS/LAMB - *Upgrade*: can't fit → FSDP; need model split → TP/PP

Answer 21

**Noise bounces you out of sharp minima into flat ones — flat = better generalization.** - Sharp minima: low train loss, steep walls → fragile at test time - Flat minima: wide valley → robust to perturbations - Small batch: noisy grads → can't stay in sharp minima → finds flat ones - Large batch: exact grads → descends into sharp minima and stays - Fewer updates/epoch = fewer course corrections - *"SGD noise is a free regularizer"* (Keskar 2017)

Answer 22

**~2D bytes/GPU via ring all-reduce, independent of GPU count.** - D = gradient size (params × bytes) - Ring: 2D×(N-1)/N → approaches 2D - Per-GPU cost doesn't grow with N → near-linear scaling - Overlaps with backward (bucket trick) - Bottleneck: small models where compute < communication

Answer 23

**Split weight matrices across GPUs — each holds a slice of each layer.** - DDP: full model each GPU → fails when too large - TP: each GPU holds 1/T of each weight matrix - Reduces **both** parameter AND activation memory - Attention heads split naturally (64/8 = 8 heads/GPU) - Mathematically exact - *"DDP splits data. TP splits weights."*

Answer 24

**Column splits output dim (no comm). Row splits input dim (needs all-reduce).** - **Column** W=[W₁|W₂]: same input X, each GPU gets partial output (different columns). No comm. - **Row** W=[W₁;W₂]: split input, each GPU gets partial sum of full output. Needs **all-reduce** to combine. - Column→Row pairing minimizes communication in transformers

Answer 25

**Column-parallel W₁ → GeLU → row-parallel W₂ → one all-reduce. 1 comm for 2 linears.** - W₁ column: full X in, each GPU gets partial hidden - GeLU: applied per GPU independently - W₂ row: partial hidden IS the right split for row input - W₂ output = partial sums → one all-reduce → final result - Without trick: 2 all-reduces. Trick saves 50% comm.

Answer 26

**Q/K/V column-parallel → local attention → W_O row-parallel → all-reduce.** - 64 heads, TP=8 → 8 heads/GPU 1. Q,K,V projections (column): each GPU for its heads. No comm. 2. Attention (local): independent per head. No comm. 3. W_O (row): partial sums 4. All-reduce → final result - 1 all-reduce for attention block. Total per layer: **2 all-reduces** (1 FFN + 1 attn)

Answer 27

**2 all-reduces/layer, ~O(b·s·d) each. Per-layer frequency demands NVLink.** - b=4, s=2048, d=8192, bf16 → ~128 MB per all-reduce - 2/layer × 40 layers = ~10 GB per forward pass - NVLink (600 GB/s): ~17 ms — hidden behind compute - InfiniBand (50 GB/s): ~200 ms — massive stalls - *Rule*: TP within one node only. Max degree = GPUs/node (~8)

Answer 28

**TP reduces both weights AND activations. FSDP only reduces weights/optimizer.** - TP: layer weights /T, activations /T, optimizer /T - FSDP: shards model state but activations stay full-size per GPU - TP helps for long sequences even after FSDP handles state - Efficiency drops at high TP (matrices too small for tensor cores) - Typical: TP = 2, 4, or 8

Answer 29

**Each head operates independently — no cross-head interaction until output projection.** - Head h: softmax(Q_h·K_h^T/√d_k)·V_h — only its own Q,K,V - 64 heads on 8 GPUs = 8/GPU, full independence - Only at W_O do results combine: row-parallel + all-reduce - One comm event for all heads - *This natural parallelism is why TP was designed for transformers*

Answer 30

**Split model by layers into stages. TP splits within layers, PP splits between layers.** - GPU0: L1-8, GPU1: L9-16, etc. Each holds ~N/P params - Comm: **point-to-point** activations between adjacent stages - Works across nodes (InfiniBand — infrequent sends) - *"TP cuts horizontally (weight matrices). PP cuts vertically (layer groups)."*

Answer 31

**Idle GPU time from sequential deps. Bubble = (P-1)/(M+P-1).** - P = stages, M = micro-batches - Naive M=1: 1/P utilization (terrible) - P=4, M=16: 16% bubble - P=4, M=32: 9% bubble - Bubble = fill + drain triangles at pipeline edges - *Rough rule*: ≈ 1/M when M >> P

Answer 32

**Like a conveyor belt — each travels through all stages, one time step per stage.** - T=0: GPU0 processes μ₁. T=1: GPU0→μ₂, GPU1→μ₁. T=2: three GPUs active. - Forward: activation [b,s,d] sent GPU_i → GPU_(i+1) - Backward: gradient flows GPU_(i+1) → GPU_i - Multiple stages active simultaneously = pipeline is "full" - Only adjacent GPUs communicate (point-to-point)

Answer 33

**Different peak activation memory. 1F1B stores P, GPipe stores M.** - GPipe: all M forwards then all M backwards → holds all M activations - 1F1B: interleave after fill → backward frees activations early → holds only P - M=32, P=4: GPipe 32×, 1F1B 4× → **8× less memory** - Same bubble fraction for both - 1F1B is standard (PipeDream, Megatron-LM)

Answer 34

**PP: few P2P sends. TP: many all-reduces. PP needs far less bandwidth.** - TP: 2× per layer → 80 all-reduces for 40 layers - PP: P-1 per forward → 3 sends for 4 stages - Both ~O(b·s·d) per message, but 80 vs 3 events! - *This is why* TP = NVLink (intra-node), PP = InfiniBand (inter-node)

Answer 35

**Each GPU holds only 1/P of params and optimizer state.** - Weights: N/P. Optimizer: 16N/P bytes. - Activations: 1F1B O(P), GPipe O(M) micro-batches - True model memory reduction (unlike DDP = full copy) - With TP: each GPU holds 1/(T×P) of total model - Risk: load imbalance if layers are uneven sizes

Answer 36

**Bubble, complexity, load imbalance.** 1. Bubble: unavoidable (P-1)/(M+P-1) 2. Scheduling complexity: multiple micro-batches, interleaving F/B 3. Load imbalance: uneven layer sizes → stages waiting 4. Small micro-batches → less GPU utilization each - Rarely used alone — always with TP + DP - Mainly for cross-node model splitting

Answer 37

**Each GPU has identical full 16N copy. Only 16N unique → waste.** - DDP: 16N × K GPUs. Only 16N unique bytes. - ZeRO: each GPU owns 1/K of state - Stage 1: shard optimizer (8N/K each) - Stage 2: + gradients - Stage 3/FSDP: + params → 16N/K each - Computation identical to DDP — only storage changes

Answer 38

**Each stage shards one more component.** - DDP: 16N | ZeRO-1: 8N+8N/K | ZeRO-2: 6N+10N/K | FSDP: **16N/K** - K=8: DDP 16N → ZeRO-1 9N → ZeRO-2 7.25N → FSDP **2N** - 7B: DDP=112 GB → FSDP w/8 GPUs = **14 GB** per GPU - ZeRO-2 also reduces comm (reduce-scatter < all-reduce) - ZeRO-3 increases comm ~1.5× vs DDP

Answer 39

**All-gather → compute → free. Per layer, twice (fwd + bwd).** *Forward*: shard only → all-gather full W → forward → FREE W → next layer *Backward*: all-gather full W again → backward → reduce-scatter grads → FREE W - Total per layer: 2× all-gather + 1× reduce-scatter = 3 comm ops - Full weights exist only **temporarily** during one layer's compute

Answer 40

**Only ONE layer's full weights at a time — not the whole model.** - Peak = shards (N/K) + one full layer (N_layer). NOT full model. - 7B, 32 layers, K=8: 1.75 GB + 440 MB = **2.2 GB** (vs DDP 14 GB) - Layers processed sequentially → never need all weights simultaneously - Why gather twice: fwd L5 and bwd L5 are ~60 layers apart in time - *Library analogy*: desk fits 1 book, drawer holds 1/4 of each

Answer 41

**Fully Sharded Data Parallelism = PyTorch-native ZeRO Stage 3.** - Fully Sharded: all 3 components sharded (params, grads, optimizer) - Data Parallel: same concept as DDP (different data per GPU) - `torch.distributed.fsdp` — no external library - DeepSpeed ZeRO: Microsoft's impl, more features (CPU/NVMe offload) - FSDP for <10B PyTorch; DeepSpeed for 10B+ with offloading

Answer 42

**FSDP ~1.5× DDP volume, in many small events vs few large.** - DDP: ~2N bytes, ~1 event (bucketed all-reduce) - FSDP: ~3N bytes, 3L events (per layer) - On NVLink: within 5-15% of DDP throughput (overlap) - Across nodes: overhead increases - *Note*: ZeRO-2 communicates LESS than DDP (reduce-scatter = half)

Answer 43

**Reduce-scatter is literally half of all-reduce.** - All-reduce = reduce-scatter + all-gather = ~2D - ZeRO-2: only reduce-scatter = ~D (50% less!) - Each GPU only needs its gradient shard → full result unnecessary - Small all-gather after optimizer for updated params - Net: ZeRO-2 comm < DDP — a free improvement

Answer 44

**Slow inter-node comm, batch explosion, single layer too big.** 1. Inter-node: per-layer all-gathers over InfiniBand can't hide 2. Batch: 512 GPUs × local_batch = huge eff. batch → convergence issues 3. Layer too large: temp gathered weights exceed GPU memory - Rule: FSDP <10B; +TP 10-30B; full 3D 30B+ - Mirelo 1-3B: FSDP alone sufficient

Answer 45

**DP × TP × PP on a 3D mesh. Total = DP × TP × PP.** - TP: weight matrices (intra-layer) - PP: layer stages (inter-layer) - DP/FSDP: training data - Example: 256 = TP=8 × PP=4 × DP=8 - FSDP can replace DDP on DP axis

Answer 46

**"TP inside, PP across, DP everywhere."** - TP: very high BW → within node (NVLink 600+ GB/s) - PP: moderate BW → across nodes (InfiniBand 25-50 GB/s) - DP/FSDP: low BW → anywhere - Violating (TP across nodes) = massive slowdowns - 1 node = 1 TP group; nodes → pipeline; pipelines → DP replicas

Answer 47

**TP=8, PP=4, DP=8. Each GPU: 1/8 of layers for 1/4 of model.** - TP=8: all GPUs in node split weights (NVLink) - PP=4: 4 nodes form pipeline - DP=8: 8 pipeline groups, different data - Per GPU: 70B/(8×4) ≈ 2.2B → ~4.4 GB state. Plenty on 80 GB A100. - Flow: TP within node, pipeline across, DP overlapped

Answer 48

**Use simplest that works.** - <1B: DDP - 1-10B: FSDP alone - 10-30B: TP + FSDP - 30B+: TP + PP + FSDP (full 3D) - Always start simple, add complexity only if needed - Mirelo 1-3B: FSDP on 1-4 nodes

Answer 49

**Replaces DDP on data-parallel axis — same role, much less memory.** - Classic 3D: DP replicas hold full model - FSDP: state sharded across DP group (16N/K instead of 16N) - <10B: FSDP alone handles memory AND speed - Larger: FSDP outer axis + TP intra-node + PP inter-node

Answer 50

**Data split → pipeline with TP inside → DP gradient sync.** 1. Each of 8 DP replicas gets different batch 2. Forward: pipeline stages compute with TP all-reduces within nodes 3. Activations flow stage→stage across nodes (1F1B schedule) 4. Backward: reverse, TP all-reduces per stage 5. DP sync: all-reduce grads across 8 replicas (overlapped) 6. Optimizer: each GPU updates its shard

Answer 51

**Compute in bf16, master weights in fp32. Optimizer updates too small for bf16.** - Tensor cores 2× faster in bf16 - But: lr × gradient can be smaller than half-precision minimum - Updates round to zero → training stalls - Solution: master weights fp32 for optimizer, cast to bf16 for compute - Total state: still 16N (4N master + 2N bf16 + 2N grads + 8N Adam)

Answer 52

**bf16 = fp32's range in fp16's size. bf16 is industry standard.** - fp16: 5 exp, 10 mantissa, max ~65,504, needs loss scaling - bf16: **8 exp**, 7 mantissa, max ~3.4×10³⁸, no loss scaling - bf16: 8 exp = same range as fp32 → no overflow - bf16: less precise than fp16 (7 vs 10) — doesn't matter for training - *"bf16 = fp32's range in fp16's size"*

Answer 53

**Scale up loss to prevent fp16 gradient underflow. Only for fp16, not bf16.** 1. scaled_loss = loss × scale (e.g., 1024) 2. backward → grads 1024× larger (above underflow) 3. Before optimizer: divide by scale 4. If inf: skip step, reduce scale 5. Stable: increase scale - Dynamic: adapts during training - bf16 doesn't need it (range = fp32)

Answer 54

**Gradient overflow — exceeds fp16 max of 65,504.** - Gradient > 65,504 → `inf` → `inf-inf` → NaN → propagates - **Overflow** = NaN. **Underflow** (→0) = stalled training, not NaN. - Fix: (1) dynamic loss scaling (2) switch to bf16 - *"NaN in fp16 → gradient overflow → loss scaling or bf16"*

Answer 55

**Don't save all activations — recompute during backward. Trades compute for memory.** - Standard: save every layer → O(L) memory - Checkpoint every √L layers, recompute between - **Memory: O(L) → O(√L)** - **Compute overhead: ~33%** (one extra forward pass) - Essential for large models + long sequences (audio!)

Answer 56

**√L balances checkpoints vs recompute. Extra forward ≈ 1/3 total.** - √L checkpoints → √L segments of √L layers each - Recompute: √L × √L = L layers = 1 extra full forward - Forward ≈ 1/3 total compute → 33% overhead - Alternatives: every layer O(1)/100%; none O(L)/0% - √L = sweet spot

Answer 57

**Can exceed model state. Essential for 1B+ and long sequences.** - Per layer: ~b×s×d×2×10 bytes - 1B (24 layers, d=2048), b=8, s=4096, bf16: ~1.3 GB/layer → **31 GB** (2× model state!) - With √24≈5 checkpoints: ~6.5 GB → 5× reduction - Essential: 1B+ models, long audio, large batches - Attention: O(s²) without FlashAttention

Answer 58

**bf16, activation checkpointing, gradient accumulation. Always.** 1. bf16: 2× speed, half activation memory, one flag 2. Activation checkpointing: O(√L) memory, ~33% compute 3. Gradient accumulation: batch control without more memory - "Before parallelism: bf16, checkpointing, gradient accumulation." - Omitting these = jumping to complexity without trying simple first

Answer 59

**Memory math → baselines → simplest parallelism → add complexity.** 1. 16N bytes. Fits on 80 GB? 2. Enable bf16, checkpointing, grad accum 3. Try FSDP alone (16N/K). Fits? 4. Not enough → add TP within nodes 5. Still not → add PP across nodes 6. Verify batch size, comm overhead, utilization - **Commit to concrete answer** with numbers, don't just explore

Answer 60

**Name axes first, then fill in. Use mental table.** - "They differ on 3 axes: what's split, communication, when to use." - DP: data, 1 AR/step, low BW, model fits - TP: weights, 2 AR/layer, high BW, intra-node - PP: layers, P-1 P2P, moderate BW, cross-node - Close: "Combine: TP inside, PP across, DP everywhere." - *Structured answers score higher with same content*

Answer 61

**Communication, bubbles, GPU underutilization, data I/O.** 1. Communication: all-reduce > compute. Slow interconnect/small model. 2. Bubble: (P-1)/(M+P-1) idle. More stages → worse. 3. GPU underutilization: matrices too small (high TP, small batch). 4. Data I/O: GPUs starved. Fix: more workers, NVMe. - **Amdahl's Law**: 10% serial → max 10× speedup, infinite GPUs - Enumerate all 4, then diagnose

Answer 62

**48 GB state → fits A100 (80 GB), 32 GB left.** - 3B × 16 = 48 GB model state - A100 80 GB - 48 = 32 GB for activations - With checkpointing: fine for moderate batch - Long audio: might be tight → FSDP/TP - Contrast: 7B × 16 = 112 GB → doesn't fit

Answer 63

**FSDP across 8-32 GPUs, bf16, activation checkpointing.** - 1.5B: 24 GB state → fits one A100, but audio = huge activations - FSDP w/8: 3 GB state → 77 GB free - bf16: standard. Checkpointing: essential for 30s audio - Grad accum: batch control - TP/PP: probably unnecessary at this scale - Simple, PyTorch-native, sufficient

Answer 64

**Speedup = 1/(s + (1-s)/N). Serial fraction caps max speedup.** - 5% serial: 8 GPUs→6×, 64 GPUs→15.4×, 256 GPUs→18.6×, ∞→20× - Efficiency drops: 75%→24%→7% - Serial: data loading, optimizer, logging, bubbles, non-overlapped comm - Pure DP has diminishing returns — serial overhead dominates - *Mentioning Amdahl shows fundamental understanding*

Answer 65

Mire_dis_train Flashcards

(89 cards)