week 3 Flashcards by Timothee Maurin

What is the goal of learning/training in ML?

Choose a hypothesis class FΘ and loss l, and find θ̂ = argminθ R̂(θ) such that f_θ̂ predicts labels well.

How well did you know this?

Not at all

Perfectly

What is empirical risk R̂(θ)?

R̂(θ)= (1/n) Σ_j l(y_j, f_θ(x_j)).

How well did you know this?

Not at all

Perfectly

Why do we minimise empirical risk?

Because the true risk is unknown; empirical risk approximates it using training data.

How well did you know this?

Not at all

Perfectly

What does gradient descent rely on?

Local information: the gradient ∇θF gives the steepest direction of increase, so −∇θF is steepest descent.

How well did you know this?

Not at all

Perfectly

What is a directional derivative?

vᵀ ∇θ F — tells the rate of change of F when moving infinitesimally in direction v.

How well did you know this?

Not at all

Perfectly

What is a descent direction?

Any v such that vᵀ∇θF < 0, meaning moving along v reduces F.

How well did you know this?

Not at all

Perfectly

What is the direction of steepest descent?

v = −∇θF / ||∇θF||.

How well did you know this?

Not at all

Perfectly

What is the gradient descent update rule?

θ_{t+1} = θ_t − γ_t ∇θ_t F.

How well did you know this?

Not at all

Perfectly

What does γ_t represent in gradient descent?

Learning rate (step size).

How well did you know this?

Not at all

Perfectly

What is gradient flow?

The continuous-time limit: dθ/dt = −∇θF(θ(t)).

How well did you know this?

Not at all

Perfectly

How is GD related to gradient flow?

GD is a numerical approximation of gradient flow with step size γ_t.

How well did you know this?

Not at all

Perfectly

Why does geometry of level sets matter for GD?

Because ∇F is orthogonal to level sets; elongated level sets cause slow convergence.

How well did you know this?

Not at all

Perfectly

What does the condition number κ measure?

κ = λ_max / λ_min of (1/n)XᵀX; large κ means slow convergence.

How well did you know this?

Not at all

Perfectly

Why is GD not applicable to 0–1 loss?

Its gradient is zero almost everywhere.

How well did you know this?

Not at all

Perfectly

What happens if ∇θF = 0?

GD gets stuck (local minimum, saddle point, plateau).

How well did you know this?

Not at all

Perfectly

How does step size affect GD?

Too small: slow. Too large: divergence or oscillations.

How well did you know this?

Not at all

Perfectly

What are common learning rate schedules?

Constant γ; diminishing γ_t→0; exponential γ_t = γ₀β^t; line search.

How well did you know this?

Not at all

Perfectly

What is excess risk decomposition?

R(f̂) − R* = (estimation error) + (approximation error).

How well did you know this?

Not at all

Perfectly

What is estimation error?

R(f̂) − R(f_{θ*}) — due to finite data.

How well did you know this?

Not at all

Perfectly

What is approximation error?

R(f_{θ}) − R — due to model class limitations.

How well did you know this?

Not at all

Perfectly

What is optimisation error?

R̂(f̂) − infθ R̂(fθ) — due to not fully minimising empirical risk.

How well did you know this?

Not at all

Perfectly

Why is full GD expensive?

Study These Flashcards

Computing ∇θ F requires evaluating all n samples; too slow for large n.

What is stochastic gradient descent (SGD)?

Study These Flashcards

θ_{t+1} = θ_t − γ_t ∇θ_t l(y_j, f_θ(x_j)) using one randomly sampled data point.

What is the key idea of SGD?

Study These Flashcards

Replace full gradient with unbiased stochastic estimate.

Why is SGD not a descent direction?

Because gradient of a single sample can point uphill.

What is mini-batch SGD?

Use a batch B_t ⊂ {1,…,n}: θ_{t+1}=θ_t−γ_t (1/|B_t|) Σ_{j∈B_t}∇l_j.

What is an epoch?

One full pass through all data using batches.

Typical batch sizes?

32–256.

What is gradient clipping?

Rescale ∇F → c ∇F/||∇F|| to prevent exploding gradients.

What is early stopping?

Stop training when validation risk stops decreasing; acts as implicit regularisation.

Why does early stopping regularise?

Parameters remain small early in training, similar to L2 penalty.

Why can SGD escape local minima?

Noise helps avoid sharp minima; SGD does not strictly follow descent directions.

Why do sharp minima generalise poorly?

They overfit; flatter minima yield better generalisation.

What is momentum in optimisation?

θ_{t+1} = θ_t − γ ∇F(θ_t) + β (θ_t − θ_{t−1}).

What does momentum do?

Smooths updates and accelerates convergence in shallow directions.

Why introduce ADAM?

Different parameters may have different scales; need adaptive per-dimension learning rates.

What does ADAM do?

Normalises gradient by its first and second moments: updates use m_t and v_t (momentum terms).

Why is ADAM effective?

Makes progress in every direction regardless of gradient scale.

What are overparameterised models?

Models with enough parameters to fit the training data exactly (interpolation).

Why do overparametrised models overfit?

They can achieve zero empirical risk but fit noise.

What is explicit regularisation?

Modify objective: J(θ)=R̂(θ)+λΩ(θ).

What is implicit regularisation?

Algorithm biases the solution without modifying objective (e.g., SGD selects minimum-norm interpolator).

What is the implicit bias of SGD for linear regression?

Among all interpolating solutions, SGD converges to the minimum Euclidean norm solution.

Does implicit regularisation depend on loss?

No — occurs for quadratic and logistic loss.

What is the double descent phenomenon?

Test error decreases → increases near interpolation → decreases again in extreme overparameterisation.

What causes double descent?

Increased model size induces inductive biases and effective simplicity beyond interpolation threshold.

How does GD implicitly regularise?

GD approximates gradient flow of F̃(θ)=F(θ)+γ/4 ||∂F/∂θ||², i.e., penalises large gradients.

What is a convex function?

F is convex if F(tη+(1−t)θ) ≤ tF(η)+(1−t)F(θ) for all θ,η and t∈[0,1].

What does ∇θF = 0 imply for convex F?

Any stationary point is a global minimum.

Do convex functions need to be differentiable?

No — convexity does not require differentiability.

How do we optimise non-smooth functions?

Replace gradient with a subgradient z ∈ ∂F(θ).

What is a subgradient?

A generalisation of gradient that exists for convex functions even if non-smooth.

What is the GD update with subgradients?

θ_{t+1} = θ_t − γ z_t, where z_t ∈ ∂F(θ_t).

week 3 Flashcards

(53 cards)