What is the general form of a statistical learning model?
Y = f(X) + ε, where f is the unknown function capturing the relationship between predictors X and response Y, and ε is a random error term independent of X with mean zero.
What is the difference between a predictor and a response variable?
Predictors (X1, X2, … Xp) are input or independent variables used to explain or predict outcomes. The response variable (Y) is the output or dependent variable being predicted.
What are the two main reasons to estimate f?
Prediction and Inference. Prediction uses f-hat to estimate Y when only X is known. Inference aims to understand how Y is associated with predictors.
What are reducible vs irreducible errors?
Reducible error comes from inaccuracies in the estimated function f-hat and can be reduced with better models. Irreducible error is the variance of the error term ε and cannot be eliminated.
What is training data?
Training data is the dataset {(x1,y1),…,(xn,yn)} used to train a model so it can learn the relationship between predictors and response.
What is the difference between parametric and non-parametric methods?
Parametric methods assume a specific functional form for f and estimate parameters. Non-parametric methods do not assume a specific shape for f and instead learn it directly from data.
What is the Mean Squared Error (MSE) formula and what does it measure?
MSE = (1/n) Σ(yi − f̂(xi))². It measures the average squared difference between predicted and actual values.
Write the Bias-Variance decomposition of expected test MSE.
E(y0 − f̂(x0))² = Var(f̂(x0)) + Bias(f̂(x0))² + Var(ε). These represent model variance, squared bias, and irreducible error.
What is the KNN classification probability formula?
Pr(Y=j | X=x0) = (1/K) Σ I(yi = j) over the K nearest neighbors. It estimates class probability by the fraction of neighbors belonging to class j.
What is the Bayes error rate formula?
1 − E[max_j Pr(Y=j | X)]. It represents the lowest possible classification error achievable.
What are the assumptions of parametric methods?
Parametric methods assume a specific functional form for the relationship between predictors and response, such as linearity.
When does irreducible error exist?
Always. It arises from unmeasured variables, inherent randomness, or noise in the system that cannot be modeled.
What is bias and what is variance in a model?
Bias is error due to incorrect assumptions about the model form (underfitting). Variance is error due to sensitivity to training data fluctuations (overfitting).
What is the shape of the test MSE curve as model flexibility increases?
A U-shaped curve. Initially test MSE decreases as bias falls, then increases as variance dominates.
What happens to bias and variance as K increases in KNN?
As K increases, bias increases and variance decreases. As K decreases, bias decreases and variance increases.
Prediction vs Inference — how do they differ?
Prediction focuses on accurate predictions and often uses flexible models. Inference focuses on understanding relationships between predictors and response.
Parametric vs Non-parametric methods — pros and cons?
Parametric methods are simpler and require less data but risk incorrect assumptions. Non-parametric methods are flexible but require more data and may overfit.
Supervised vs Unsupervised learning — what’s the difference?
Supervised learning uses labeled data with both X and Y. Unsupervised learning uses only X and seeks patterns or structure in data.
Regression vs Classification — when do you use each?
Regression is used when the response variable is continuous. Classification is used when the response variable is categorical.
Flexibility vs Interpretability in models?
As model flexibility increases, interpretability generally decreases. Simple linear models are interpretable; complex models like deep learning are less interpretable.
Bayes Classifier vs KNN — how are they related?
The Bayes classifier is the theoretical optimal classifier using true probabilities. KNN approximates these probabilities using nearby observations.
Why is minimizing training MSE not sufficient for model selection?
Because a model may overfit training data. The goal is to minimize test MSE, which reflects performance on unseen data.
When would you prefer a less flexible model?
When interpretability is important, when the true relationship is simple, when data is limited, or when avoiding overfitting.
What is the No Free Lunch theorem in statistics?
No single algorithm performs best for all datasets. Model performance depends on the specific data and problem.