week 5 Flashcards by Timothee Maurin

What is the purpose of a feature map ϕ(x)?

To map data into a higher-dimensional space where a linear separator may exist.

How well did you know this?

Not at all

Perfectly

Why can nonlinear problems become linearly separable after a feature map?

Because ϕ(x) adds nonlinear functions such as x², enabling linear boundaries in feature space.

How well did you know this?

Not at all

Perfectly

Why do polynomial feature maps become expensive?

Their dimension grows combinatorially with degree; for degree s in d dimensions, size = Σ_{j=1}^s (d+j−1 choose j).

How well did you know this?

Not at all

Perfectly

What is the idea behind kernel methods?

Use infinite-dimensional feature maps indirectly via kernels without computing ϕ(x).

How well did you know this?

Not at all

Perfectly

What is a Hilbert space?

A possibly infinite-dimensional inner product space (e.g., ℝ^a, ℓ², L²).

How well did you know this?

Not at all

Perfectly

How is a linear model written in feature space?

fθ(x)=⟨θ,ϕ(x)⟩.

How well did you know this?

Not at all

Perfectly

Why is training directly in infinite dimensions impossible?

θ has infinitely many coordinates; optimisation cannot be done in ℋ directly.

How well did you know this?

Not at all

Perfectly

What does the representer theorem state?

The minimiser θ* lies in the finite span of data: θ* = Σ_i α_i ϕ(x_i).

How well did you know this?

Not at all

Perfectly

What is the key consequence of the representer theorem?

An infinite-dimensional optimisation reduces to an n-dimensional optimisation in α.

How well did you know this?

Not at all

Perfectly

What is a reproducing kernel?

A function k(x,z)=⟨ϕ(x),ϕ(z)⟩ giving feature-space inner products.

How well did you know this?

Not at all

Perfectly

What does k(xz) represent?

Similarity between x and z in feature space; higher k means more similar.

How well did you know this?

Not at all

Perfectly

What is the kernel trick?

Replacing ⟨ϕ(x),ϕ(z)⟩ with k(x,z) so feature maps need not be computed.

How well did you know this?

Not at all

Perfectly

What is the Gram matrix K?

K ∈ ℝ^{n×n} with entries K_ij = k(x_i, x_j).

How well did you know this?

Not at all

Perfectly

What is the kernelised training objective?

argmin_α (1/n) Σ_j l(y_j,(Kα)_j) + (λ/2) αᵀ K α.

How well did you know this?

Not at all

Perfectly

What is the kernelised predictor?

fα(x)=Σ_i α_i k(x, x_i).

How well did you know this?

Not at all

Perfectly

How many parameters does a kernel model have?

One parameter α_i per training datapoint.

How well did you know this?

Not at all

Perfectly

Why does infinite feature dimension not matter in kernel methods?

All computations require only kernel values k(x,z).

How well did you know this?

Not at all

Perfectly

Give examples of valid kernels.

Constant k=c², Linear k=xᵀz, Polynomial k=(1+xᵀz)^a, Gaussian k=exp(−γ‖x−z‖²), Exponential, Matern.

How well did you know this?

Not at all

Perfectly

What is an RBF (Gaussian) kernel?

Study These Flashcards

k(x,z)=exp(−γ‖x−z‖²).

What does γ control in the RBF kernel?

Study These Flashcards

Lengthscale of similarity: large γ = very local similarity; small γ = smoother decision boundaries.

How does RBF kernel classify difficult nonlinear data?

Study These Flashcards

By creating highly flexible decision functions in infinite-dimensional feature space.

Why do polynomial kernels fail on the moons dataset?

Study These Flashcards

They cannot capture the oscillatory, non-smooth boundaries needed.

Which kernel works well on the moons dataset?

Study These Flashcards

The RBF (Gaussian) kernel.

How does kernel-SVM classification work?

Study These Flashcards

Sign(Σ_i α_i k(x,x_i) − b) using hinge loss and representer theorem.

What library does scikit-learn use for SVM?

LIBSVM (quadratic programming solver).

Why is face recognition high-dimensional?

Images have thousands of pixels; e.g., 62×47 ≈ 3000 features.

Why apply PCA before SVM in face recognition?

Reduce dimensionality, avoid dominated features, improve conditioning.

What problem occurs in the LFW face dataset?

Imbalanced number of examples per person.

How is imbalance handled?

Downsampling, data augmentation, or modifying loss to emphasise minority class.

What hyperparameters must be chosen for RBF SVM?

C (regularisation) and γ (kernel width).

How are C and γ selected?

Using cross-validation grid search.

What accuracy was achieved in the face recognition example?

High accuracy with some misclassified ambiguous images.

What does a confusion matrix show?

Which classes are commonly confused and where errors occur.

What is MNIST?

A dataset of 60k train + 10k test 28×28 handwritten digits.

What task is shown in MNIST (lecture)?

Binary classification: digit 3 vs digit 8.

How well does an RBF SVM perform on MNIST?

≈99% accuracy on clean data.

How expressive is the RBF kernel?

It is a universal approximator: can approximate any continuous function.

What are random features?

Approximate ϕ(x) via sampled basis functions b(x,ω_i), giving φ_ω(x)≈ϕ(x).

Why use random features?

Computing full K is expensive (O(n²)); random features approximate kernels cheaply.

What does kernel representer theorem generalise to?

Any objective Ψ depending only on ⟨θ,ϕ(x_i)⟩ and ‖θ‖ gives θ* in span{ϕ(x_i)}.

Give an example of a kernel for non-vector data.

Jaccard kernel on sets: k(A,B)=|A∩B| / |A∪B|.

What are benefits of kernel methods?

Handle infinite-dimensional spaces, strong theory, work on non-vector data, relate to many statistical models.

What are Concept Bottleneck Models?

Models where prediction f = h∘g must rely on human-interpretable concepts.

How to train concept bottleneck models independently?

Train ĝ = argmin Σ L_c(c_j, g(x_j)), then train ĥ = argmin Σ L_l(y_j, h(c_j)).

How to train concept bottleneck models sequentially?

Learn ĝ first via L_c, then train h on predicted concepts ĥ = argmin Σ L_l(y_j, h(ĝ(x_j))).

How to train concept bottleneck models jointly?

Minimise Σ L_l(y_j, h(g(x_j))) + λ Σ L_c(c_j, g(x_j)).

What issues arise in concept bottleneck models?

Missing or incorrect concepts; good predictions but wrong concepts → harms interpretability.

week 5 Flashcards

(47 cards)