ML: Gradient Descent & Optimisation Flashcards by Henry Ye

What is the update rule for vanilla gradient descent?

w{n+1} = w{n} - a Grad J(w{n})

New params = current params - learning rate * Gradient

Gradient = partial derivative of the cost/loss function J with respect to params w. It indicates the direction and magnitude of the steepest ascept at the current point w{n}

How well did you know this?

Not at all

Perfectly

Why does gradient descent converge more slowly in ill-conditioned optimization landscapes?

When the Hessian has eigenvalues with large spread or large ratio between largest and smallest eigenvalues (high condition number).

Steep valleys: extremely steep slopes in some directions (large eigenvalues)

Shallow, elongated valleys: very gentle slopes in other directions (low eigenvalues).

Gradients point in poor directions and are non-uniform → zig-zagging and slow convergence.

How well did you know this?

Not at all

Perfectly

What is the difference between gradient descent, stochastic gradient descent (SGD), and mini-batch SGD?

GD: uses entire dataset each step; stable but slow.

SGD: uses 1 sample; fast but noisy.

Mini-batch: compromise; best for GPU training; stable + efficient.

How well did you know this?

Not at all

Perfectly

Why can gradient descent get stuck at saddle points in high-dimensional spaces?

Saddle points have zero gradient and usually have long flat plateaus around with near zero gradient.

Saddle points outnumber local minima in high dimensions.

Gradients vanish but curvature (Hessian) has mixed signs → optimization stalls.

How well did you know this?

Not at all

Perfectly

What is the intuition behind momentum?

Adds an exponentially decaying moving average of past gradients → smooths noise and accelerates along consistent directions.

Accelerates training

Overcomes small local obstacles

Solves problem of slow convergence in ill-conditioned, non-uniform optimization landscapes.

How well did you know this?

Not at all

Perfectly

Why is learning rate the most important hyperparameter in gradient descent?

Too small → slow training

Too large → divergence or oscillation

Controls stability, convergence speed, generalization

How well did you know this?

Not at all

Perfectly

Why do optimizers like Adam and RMSProp outperform SGD on deep networks?

They adapt learning rates per parameter using estimates of first/second moments of gradients → better handling of sparse gradients and curvature.

Divide LR by moving avg of squared grads - giving smaller updates (smaller LR) to frequently updated params (large grads) and larger updates (larger LR) to rarely updated ones (sparse/small grads).

Adam combines RMSPro’s adaptive rates with momentum, using exponentially decaying avg of past gradeints. Accelerates convergence in directions that are consistent and dampens oscillations.

How well did you know this?

Not at all

Perfectly

When training on a GPU, why are large batch sizes often preferred?

Better hardware utilization

Higher throughput

More stable gradient estimates
But too large → poorer generalization + memory limits.

How well did you know this?

Not at all

Perfectly

What is gradient clipping, and why is it used?

Clip gradient norm/magnitude to limit exploding gradients.

Common in:

RNNs

Transformers

Reinforcement learning

How well did you know this?

Not at all

Perfectly

Why does mixed-precision training help gradient descent?

Faster matrix multiplications

Lower memory usage → larger batches

Uses loss-scaling to avoid underflow in gradients

How well did you know this?

Not at all

Perfectly

Why do transformers often require learning rate warmup?

At initialization, gradients are unstable; warmup gradually increases the learning rate → prevents divergence in early steps.

How well did you know this?

Not at all

Perfectly

What is the role of gradient descent in fine-tuning LLMs with LoRA?

Gradient descent updates only the low-rank matrices A and B, keeping the main weights frozen. Reduces compute and prevents catastrophic forgetting.

How well did you know this?

Not at all

Perfectly

Why does gradient variance matter for agentic systems with online learning or memory updates?

High variance leads to unstable updates in retrieval adapters or memory-writing modules → causes inconsistent agent behavior.

How well did you know this?

Not at all

Perfectly

How does gradient descent behave differently in RLHF?

The loss gradient depends on reward-model outputs, which are themselves learned → non-stationary gradients → requires careful LR scheduling and regularization.

How well did you know this?

Not at all

Perfectly

How does the Hessian eigenvalue spectrum affect GD convergence rates?

Convergence rate ~ dominated by largest eigenvalue (Lipschitz constant).
Smallest eigenvalues determine how slow directions converge.

How well did you know this?

Not at all

Perfectly

What is gradient noise scale, and why does it matter?

Study These Flashcards

Controls optimal batch size and generalization dynamics.

Why is second-order GD (Newton’s method) rarely used in deep learning?

Study These Flashcards

Requires Hessian inversion → O(n³) cost + massive memory.

What assumptions guarantee convergence of batch gradient descent for convex functions?

Study These Flashcards

Function is convex and differentiable

Gradient is L-Lipschitz continuous

Learning rate
0 < 𝜂 < 2/L

Why does SGD not follow the true gradient direction?

Study These Flashcards

Because it uses the gradient of a single sample (or small subset), which is an unbiased but noisy estimator of the full gradient.

Why does mini-batch SGD converge faster than pure SGD?

Study These Flashcards

It reduces gradient variance while maintaining stochasticity → better direction estimates and more stable convergence.

What is the expected behavior of SGD at the end of optimization?

Study These Flashcards

It oscillates around a minimum because of gradient noise; the step size must decay to ensure convergence.

How does GD differ from SGD in terms of convergence to global minima for convex problems?

Study These Flashcards

GD → converges exactly to optimum.

SGD → converges in expectation, and oscillates unless learning rate decays.

In non-convex optimization, how does SGD help escape saddle points?

Study These Flashcards

Gradient noise from random sampling provides stochastic perturbations that can push parameters away from points where gradient is near zero.

Why is batch gradient descent rarely used in deep networks?

Study These Flashcards

Requires full dataset per step → huge memory and time cost

Offers no stochasticity → easily stuck in saddle points or sharp minima

When would you prefer batch gradient descent over SGD or mini-batch?

When dataset is small and fits entirely into memory When exact gradients are needed (e.g., convex optimization research) When deterministic reproducibility is critical

Why is shuffling data important in SGD?

Prevents correlated gradient directions → avoids cycles and improves convergence.

How do you choose batch size in practice?

Maximize GPU utilization while fitting in memory Small enough to provide useful gradient noise Typical: 32–1024 depending on model size and hardware.

How does mini-batch size affect generalization?

Smaller batches → more noise → often better generalization Larger batches → smoother trajectories → risk of sharp minima

Why do many optimizers apply weight decay separately from gradient descent?

Coupled L2 regularization changes the gradient itself; decoupled weight decay (AdamW) cleanly shrinks weights independent of gradient updates → better optimization stability.

What is the effect of using very small batch sizes (< 8)?

High variance → unstable gradients Poor GPU utilization Often requires gradient accumulation to simulate bigger batches

Why might you use gradient accumulation with mini-batch SGD?

To simulate large batch sizes without increasing memory usage.

Why is batch normalization helpful for SGD?

It stabilizes gradient magnitudes and reduces internal covariate shift → allows for higher learning rates and faster convergence.

What is the intuition behind learning rate decay in SGD?

High LR helps explore → Lower LR helps settle into minima → Reduces oscillation caused by noisy gradient estimates.

Why is random sampling without replacement typically preferred in SGD loops?

Ensures each epoch covers all data exactly once → avoids bias and repeated sampling.

How do you detect if your LR is too high using training curves?

Loss diverges Loss oscillates wildly Loss temporarily decreases then explodes

How do you detect if LR is too low?

Loss decreases very slowly Plateaus early Training takes excessively long

ML: Gradient Descent & Optimisation Flashcards

(36 cards)