What is the goal of learning/training in ML?
Choose a hypothesis class FΘ and loss l, and find θ̂ = argminθ R̂(θ) such that f_θ̂ predicts labels well.
What is empirical risk R̂(θ)?
R̂(θ)= (1/n) Σ_j l(y_j, f_θ(x_j)).
Why do we minimise empirical risk?
Because the true risk is unknown; empirical risk approximates it using training data.
What does gradient descent rely on?
Local information: the gradient ∇θF gives the steepest direction of increase, so −∇θF is steepest descent.
What is a directional derivative?
vᵀ ∇θ F — tells the rate of change of F when moving infinitesimally in direction v.
What is a descent direction?
Any v such that vᵀ∇θF < 0, meaning moving along v reduces F.
What is the direction of steepest descent?
v = −∇θF / ||∇θF||.
What is the gradient descent update rule?
θ_{t+1} = θ_t − γ_t ∇θ_t F.
What does γ_t represent in gradient descent?
Learning rate (step size).
What is gradient flow?
The continuous-time limit: dθ/dt = −∇θF(θ(t)).
How is GD related to gradient flow?
GD is a numerical approximation of gradient flow with step size γ_t.
Why does geometry of level sets matter for GD?
Because ∇F is orthogonal to level sets; elongated level sets cause slow convergence.
What does the condition number κ measure?
κ = λ_max / λ_min of (1/n)XᵀX; large κ means slow convergence.
Why is GD not applicable to 0–1 loss?
Its gradient is zero almost everywhere.
What happens if ∇θF = 0?
GD gets stuck (local minimum, saddle point, plateau).
How does step size affect GD?
Too small: slow. Too large: divergence or oscillations.
What are common learning rate schedules?
Constant γ; diminishing γ_t→0; exponential γ_t = γ₀β^t; line search.
What is excess risk decomposition?
R(f̂) − R* = (estimation error) + (approximation error).
What is estimation error?
R(f̂) − R(f_{θ*}) — due to finite data.
What is approximation error?
R(f_{θ}) − R — due to model class limitations.
What is optimisation error?
R̂(f̂) − infθ R̂(fθ) — due to not fully minimising empirical risk.
Why is full GD expensive?
Computing ∇θ F requires evaluating all n samples; too slow for large n.
What is stochastic gradient descent (SGD)?
θ_{t+1} = θ_t − γ_t ∇θ_t l(y_j, f_θ(x_j)) using one randomly sampled data point.
What is the key idea of SGD?
Replace full gradient with unbiased stochastic estimate.