week 2 Flashcards by Timothee Maurin

What is the Bayes optimal classifier in binary classification?

The predictor f*(x) = 1[ P(Y=1|X=x) ≥ threshold ] that minimises expected risk.

How well did you know this?

Not at all

Perfectly

What is the likelihood ratio used in the optimal rule?

“ℒ(x) = ρ(x|Y=1) / ρ(x|Y=0).”

How well did you know this?

Not at all

Perfectly

What form does the Bayes classifier take using the likelihood ratio?

“f*(x)=1[ ℒ(x) ≥ (π₀(l01−l00)) / (π₁(l10−l11)) ].”

How well did you know this?

Not at all

Perfectly

What is a likelihood ratio test (LRT)?

A classifier of the form fη(x)=1[ℒ(x)≥η] with threshold η.

How well did you know this?

Not at all

Perfectly

What does Neyman–Pearson lemma state?

If class-conditional densities are continuous, the classifier that maximises TPR subject to FPR ≤ α is an LRT with threshold chosen so FPR=α.

How well did you know this?

Not at all

Perfectly

What are Type I and Type II errors?

Type I: false positive. Type II: false negative.

How well did you know this?

Not at all

Perfectly

What is True Positive Rate (TPR)?

P(f(X)=1 | Y=1).

How well did you know this?

Not at all

Perfectly

What is False Negative Rate (FNR)?

P(f(X)=0 | Y=1).

How well did you know this?

Not at all

Perfectly

What is False Positive Rate (FPR)?

P(f(X)=1 | Y=0).

How well did you know this?

Not at all

Perfectly

What is True Negative Rate (TNR)?

P(f(X)=0 | Y=0).

How well did you know this?

Not at all

Perfectly

How can risk be decomposed using FPR and TPR?

R[f] = α·FPR − β·TPR + γ for constants α,β≥0.

How well did you know this?

Not at all

Perfectly

What is an alternative supervised learning goal beyond minimising risk?

Maximising TPR subject to constraint FPR ≤ α.

How well did you know this?

Not at all

Perfectly

What distinguishes discriminative from generative models?

Discriminative models learn predictors f; generative models model ρ(x,y) or ρ(y), ρ(x|y).

How well did you know this?

Not at all

Perfectly

What do generative models learn?

ρ(x|y), ρ(y) allowing computation of ρ(y|x) via Bayes’ rule.

How well did you know this?

Not at all

Perfectly

What does Linear Discriminant Analysis assume?

Classes have Gaussian class-conditional distributions with shared covariance Σ and different means μᵢ.

How well did you know this?

Not at all

Perfectly

What is Quadratic Discriminant Analysis?

A generative model like LDA but each class has its own covariance Σᵢ.

How well did you know this?

Not at all

Perfectly

What assumption does Naive Bayes make?

Conditional independence of features xⱼ given class y.

How well did you know this?

Not at all

Perfectly

What loss corresponds to maximum likelihood estimation?

Study These Flashcards

The log-loss l(x,y,θ)=−logρθ(x,y).

What is empirical risk for maximum likelihood?

Study These Flashcards

R̂(θ)=Σ_j −logρθ(x_j, y_j).

Why are generative models hard in practice?

Study These Flashcards

It’s difficult to specify a realistic density model ρ(x,y) for complex data.

What is the purpose of regularisation?

Study These Flashcards

To penalise model complexity and prevent overfitting by modifying the objective.

What is explicit regularisation?

Study These Flashcards

Adding λΩ(f) to empirical risk: J(f)=R̂(f)+λΩ(f).

What does λ (lambda) control in regularisation?

Study These Flashcards

The strength of the complexity penalty; it is a hyperparameter chosen with cross-validation.

What is L2 regularisation?

Study These Flashcards

Ω(θ)=‖θ‖₂² = Σ_j θⱼ² (ridge).

What effect does L2 regularisation have?

Shrinks parameters but does not force them to zero.

What is L1 regularisation?

Ω(θ)=‖θ‖₁ = Σ_j |θⱼ| (lasso).

What effect does L1 regularisation have?

Encourages sparsity by driving some parameters to zero.

How does cross-validation help choose λ?

Evaluate model performance on validation folds for different λ and choose the λ with best generalisation.

What problem does Lasso regression solve?

Minimising (1/n) Σ (y - fθ(x))² + λ‖θ‖₁.

What does a linear separator do?

Finds a hyperplane that correctly separates binary classes.

How is a line written in vector form?

wᵀx − b = 0.

How does a linear classifier assign labels?

Predict +1 if wᵀx − b > 0, predict −1 if wᵀx − b < 0.

When is a point (x,y) correctly classified?

When y(wᵀx − b) > 0.

What does SVM aim to maximise?

The margin: the distance between decision boundary and closest data point.

What is hinge loss?

l(y,f(x)) = max(1 − y(wᵀx − b), 0).

What objective does SVM minimise?

J(w,b)= (1/n)Σ max(1−y_j(wᵀx_j−b),0) + (λ/2)‖w‖₂².

Why include regularisation in SVM?

To penalise large weights and improve generalisation.

What is a soft-margin SVM?

An SVM with low C (high λ), allowing misclassified or low-margin points.

What is a hard-margin SVM?

An SVM with very large C (small λ), forcing perfect separability.

How is the perceptron related to SVM?

It is SVM with b=0, λ=0, trained via stochastic gradient descent.

What probability does logistic regression model?

P(y=1|x) = σ(wᵀx − b).

What loss does logistic regression typically use?

Cross-entropy loss l(y,z)=−y log z − (1−y) log(1−z).

What is logistic loss for classification?

−1[y=1] log σ(wᵀx−b) − 1[y=0] log σ(−wᵀx+b).

How can linear models handle non-linear boundaries?

By applying feature maps x → φ(x) that add nonlinear transformations.

week 2 Flashcards

(44 cards)