What is bias-variance tradeoff?
Linear Regression vs. Deep NN, which has higher bias/variance?
Bias: Error due to wrong assumptions → underfitting.
Variance: Error due to sensitivity to data fluctuations → overfitting.
Linear Regression - high bias
Deep NN - high variance
Linear Regression equation, what optimisation method is used?
y ≈ β₀ + β₁x₁ + … + βₙxₙ
OLS used
OLS closed form answer, Is it convex/concave? Assumptions of OLS?
β̂ = (XᵀX)⁻¹ Xᵀy
Convex because quadratic and global minimum guaranteed.
Evaluation of Linear Regression metrics, their formulas and standard error of β̂.
R² = 1 − (RSS/TSS) → fraction of variance explained by the model.
Other metrics: MSE, RMSE.
SE(β̂) = √[ σ² (XᵀX)⁻¹ ] = √ [σ²/n]
Where:
σ² = variance of the errors (estimated as σ̂² = RSS / (n − p), with p = number of predictors including intercept).
Properties of R² and Adjusted R² formula.
Can’t decrease R² when you add variables (even useless ones), so we use:
Adjusted R²
= 1 − [(1 − R²) * (n − 1) / (n − p − 1)]
What does F statistic say and what is p value? When do we reject H₀?
It tests whether the overall regression model is significant, i.e., whether at least one predictor has a non-zero coefficient.
The p-value measures the probability of observing the data (or more extreme) assuming the null hypothesis H₀ is true.
Generally when p<0.05, we reject H₀
Methods of Feature selection in Multiple Linear Regression?
1) Forward selection - start with null model and add variables based on p values or R2 values.
2) Backward selection: start w/ all predictors and then remove one at a time to a simpler/better model.
3) Mixed selection: start with a null model keep adding and subtracting using p values or adjusted R2.
What is the OLS objective function for baseline linear regression, ridge, lasso and elastic net.
Regular: min_w ||y − Xw||²
Ridge:
min_w ||y − Xw||² + λ||w||²₂
Lasso: min_w ||y − Xw||² + λ||w||₁
Elastic Net:
min_w ||y − Xw||² + λ₁||w||₁ + λ₂||w||²₂
What is Ridge Regression?
How does ridge regression shrink coefficients?
Does ridge regression perform feature selection?
Linear regression with an L2 penalty on coefficients.
Continuously shrinks them toward zero but never exactly zero.
No.
When is Ridge preferred?
Effect of λ in ridge regression?
What happens when λ = 0?
How does ridge handle multicollinearity?
When many correlated predictors exist.
Larger λ → stronger shrinkage.
Ridge reduces to OLS.
Distributes coefficient weights across correlated features.
What is Lasso? Key property of lasso? Does Lasso perform feature selection?
Linear regression with an L1 penalty on coefficients.
Produces sparse models.
Yes.
How does lasso behave with correlated predictors?
When is lasso preferred?
Selects one and ignores the rest.
When only a few predictors are truly relevant.
What is Elastic Net? Why was elastic net introduced? How does elastic net handle correlated predictors?
A combination of L1 and L2 regularization.
To combine sparsity of lasso with stability of ridge.
Tends to select groups of correlated features together.