What is the goal of supervised learning?
Learn a predictor f: X → Y that maps features to labels using labelled data.
What is feature selection?
Choosing measurable quantities (features x¹,…,xᵈ) that help predict the output label.
What is a feature vector?
A vector x ∈ ℝᵈ representing the selected measurable attributes of an input.
What is a label in supervised learning?
A quantitative or categorical value y representing the target to be predicted.
What is a labelled dataset?
A set of pairs (x₁,y₁),…,(xₙ,yₙ) where each feature vector has a corresponding label.
What is a learning algorithm?
A map taking a dataset (X × Y)ⁿ and outputting a predictor f: X → Y.
What makes a predictor good?
It should approximate the ‘true’ predictor f_true, though the true label function rarely exists exactly.
Why does the ‘true’ predictor rarely exist?
Because labelers may disagree or the features do not capture all necessary information (plant location example).
What is the minimum error predictor?
A predictor that assigns labels to minimize misclassification error over the entire population.
Why can’t we compute the true minimum-error predictor directly?
It requires knowing the entire population distribution of (X,Y).
What assumption is made in statistical learning to avoid enumerating the whole population?
We assume the population is described by a probability distribution ρ ∈ Prob(X × Y).
What does sampling from the population correspond to?
Drawing samples from ρ, the assumed distribution over (X,Y).
What is the Bayes optimal predictor f*?
The predictor that minimizes expected loss: f*(x) ∈ argmin_z ∫ l(y,z) dρ(y|x).
What does indicator notation 1[⋯] mean?
It equals 1 if the condition is true, otherwise 0.
What does the Bayes classifier do in binary classification?
Predicts 1 if P(Y=1|X=x) exceeds a threshold depending on the loss function; otherwise predicts 0.
Why is Bayes risk non-zero?
Because the conditional distributions overlap; even the optimal classifier misclassifies some points.
What is the disadvantage of k-NN compared to the Bayes classifier?
It is slow, requires the dataset at inference time, and tends to overfit.
What is the goal of decision theory in ML?
To compute f* that minimizes expected risk under the true distribution ρ.
What is risk R[f]?
The expected loss: ∫ l(y,f(x)) dρ(x,y).
What alternative objective involving TPR and FPR can be used?
Maximize TPR subject to constraints on FPR (useful in imbalanced classification).
What is a False Positive Rate (FPR)?
P(f(X)=1 | Y=0).
What is a True Positive Rate (TPR)?
P(f(X)=1 | Y=1).
Why do we use probabilistic modelling in supervised learning?
Because we cannot observe the entire population, so we model (X,Y) with a probability distribution.
How does the Abalone example illustrate minimum error prediction?
If x = number of rings ≤ 8, most are male → f(x)=male minimizes empirical error.