What is the purpose of a feature map ϕ(x)?
To map data into a higher-dimensional space where a linear separator may exist.
Why can nonlinear problems become linearly separable after a feature map?
Because ϕ(x) adds nonlinear functions such as x², enabling linear boundaries in feature space.
Why do polynomial feature maps become expensive?
Their dimension grows combinatorially with degree; for degree s in d dimensions, size = Σ_{j=1}^s (d+j−1 choose j).
What is the idea behind kernel methods?
Use infinite-dimensional feature maps indirectly via kernels without computing ϕ(x).
What is a Hilbert space?
A possibly infinite-dimensional inner product space (e.g., ℝ^a, ℓ², L²).
How is a linear model written in feature space?
fθ(x)=⟨θ,ϕ(x)⟩.
Why is training directly in infinite dimensions impossible?
θ has infinitely many coordinates; optimisation cannot be done in ℋ directly.
What does the representer theorem state?
The minimiser θ* lies in the finite span of data: θ* = Σ_i α_i ϕ(x_i).
What is the key consequence of the representer theorem?
An infinite-dimensional optimisation reduces to an n-dimensional optimisation in α.
What is a reproducing kernel?
A function k(x,z)=⟨ϕ(x),ϕ(z)⟩ giving feature-space inner products.
What does k(xz) represent?
Similarity between x and z in feature space; higher k means more similar.
What is the kernel trick?
Replacing ⟨ϕ(x),ϕ(z)⟩ with k(x,z) so feature maps need not be computed.
What is the Gram matrix K?
K ∈ ℝ^{n×n} with entries K_ij = k(x_i, x_j).
What is the kernelised training objective?
argmin_α (1/n) Σ_j l(y_j,(Kα)_j) + (λ/2) αᵀ K α.
What is the kernelised predictor?
fα(x)=Σ_i α_i k(x, x_i).
How many parameters does a kernel model have?
One parameter α_i per training datapoint.
Why does infinite feature dimension not matter in kernel methods?
All computations require only kernel values k(x,z).
Give examples of valid kernels.
Constant k=c², Linear k=xᵀz, Polynomial k=(1+xᵀz)^a, Gaussian k=exp(−γ‖x−z‖²), Exponential, Matern.
What is an RBF (Gaussian) kernel?
k(x,z)=exp(−γ‖x−z‖²).
What does γ control in the RBF kernel?
Lengthscale of similarity: large γ = very local similarity; small γ = smoother decision boundaries.
How does RBF kernel classify difficult nonlinear data?
By creating highly flexible decision functions in infinite-dimensional feature space.
Why do polynomial kernels fail on the moons dataset?
They cannot capture the oscillatory, non-smooth boundaries needed.
Which kernel works well on the moons dataset?
The RBF (Gaussian) kernel.
How does kernel-SVM classification work?
Sign(Σ_i α_i k(x,x_i) − b) using hinge loss and representer theorem.