What is the update rule for vanilla gradient descent?
w{n+1} = w{n} - a Grad J(w{n})
New params = current params - learning rate * Gradient
Gradient = partial derivative of the cost/loss function J with respect to params w. It indicates the direction and magnitude of the steepest ascept at the current point w{n}
Why does gradient descent converge more slowly in ill-conditioned optimization landscapes?
When the Hessian has eigenvalues with large spread or large ratio between largest and smallest eigenvalues (high condition number).
Steep valleys: extremely steep slopes in some directions (large eigenvalues)
Shallow, elongated valleys: very gentle slopes in other directions (low eigenvalues).
Gradients point in poor directions and are non-uniform → zig-zagging and slow convergence.
What is the difference between gradient descent, stochastic gradient descent (SGD), and mini-batch SGD?
GD: uses entire dataset each step; stable but slow.
SGD: uses 1 sample; fast but noisy.
Mini-batch: compromise; best for GPU training; stable + efficient.
Why can gradient descent get stuck at saddle points in high-dimensional spaces?
Saddle points have zero gradient and usually have long flat plateaus around with near zero gradient.
Saddle points outnumber local minima in high dimensions.
Gradients vanish but curvature (Hessian) has mixed signs → optimization stalls.
What is the intuition behind momentum?
Adds an exponentially decaying moving average of past gradients → smooths noise and accelerates along consistent directions.
Accelerates training
Overcomes small local obstacles
Solves problem of slow convergence in ill-conditioned, non-uniform optimization landscapes.
Why is learning rate the most important hyperparameter in gradient descent?
Too small → slow training
Too large → divergence or oscillation
Controls stability, convergence speed, generalization
Why do optimizers like Adam and RMSProp outperform SGD on deep networks?
They adapt learning rates per parameter using estimates of first/second moments of gradients → better handling of sparse gradients and curvature.
Divide LR by moving avg of squared grads - giving smaller updates (smaller LR) to frequently updated params (large grads) and larger updates (larger LR) to rarely updated ones (sparse/small grads).
Adam combines RMSPro’s adaptive rates with momentum, using exponentially decaying avg of past gradeints. Accelerates convergence in directions that are consistent and dampens oscillations.
When training on a GPU, why are large batch sizes often preferred?
Better hardware utilization
Higher throughput
More stable gradient estimates
But too large → poorer generalization + memory limits.
What is gradient clipping, and why is it used?
Clip gradient norm/magnitude to limit exploding gradients.
Common in:
RNNs
Transformers
Reinforcement learning
Why does mixed-precision training help gradient descent?
Faster matrix multiplications
Lower memory usage → larger batches
Uses loss-scaling to avoid underflow in gradients
Why do transformers often require learning rate warmup?
At initialization, gradients are unstable; warmup gradually increases the learning rate → prevents divergence in early steps.
What is the role of gradient descent in fine-tuning LLMs with LoRA?
Gradient descent updates only the low-rank matrices A and B, keeping the main weights frozen. Reduces compute and prevents catastrophic forgetting.
Why does gradient variance matter for agentic systems with online learning or memory updates?
High variance leads to unstable updates in retrieval adapters or memory-writing modules → causes inconsistent agent behavior.
How does gradient descent behave differently in RLHF?
The loss gradient depends on reward-model outputs, which are themselves learned → non-stationary gradients → requires careful LR scheduling and regularization.
How does the Hessian eigenvalue spectrum affect GD convergence rates?
Convergence rate ~ dominated by largest eigenvalue (Lipschitz constant).
Smallest eigenvalues determine how slow directions converge.
What is gradient noise scale, and why does it matter?
Controls optimal batch size and generalization dynamics.
Why is second-order GD (Newton’s method) rarely used in deep learning?
Requires Hessian inversion → O(n³) cost + massive memory.
What assumptions guarantee convergence of batch gradient descent for convex functions?
Function is convex and differentiable
Gradient is L-Lipschitz continuous
Learning rate
0 < 𝜂 < 2/L
Why does SGD not follow the true gradient direction?
Because it uses the gradient of a single sample (or small subset), which is an unbiased but noisy estimator of the full gradient.
Why does mini-batch SGD converge faster than pure SGD?
It reduces gradient variance while maintaining stochasticity → better direction estimates and more stable convergence.
What is the expected behavior of SGD at the end of optimization?
It oscillates around a minimum because of gradient noise; the step size must decay to ensure convergence.
How does GD differ from SGD in terms of convergence to global minima for convex problems?
GD → converges exactly to optimum.
SGD → converges in expectation, and oscillates unless learning rate decays.
In non-convex optimization, how does SGD help escape saddle points?
Gradient noise from random sampling provides stochastic perturbations that can push parameters away from points where gradient is near zero.
Why is batch gradient descent rarely used in deep networks?
Requires full dataset per step → huge memory and time cost
Offers no stochasticity → easily stuck in saddle points or sharp minima