What is the geometric interpretation of linear regression?
Fitting a hyperplane that minimizes the perpendicular distance (least squares) between observed targets and predictions; mathematically, projecting y onto the column space of X.
Derive the closed-form OLS solution.
Minimize: β₯π¦βππ½β₯^2
Set derivative to zero: π^πππ½=π^ππ¦
Solution (if invertible):π½=(π^ππ)^β1π^ππ¦
Why is OLS unbiased?
Because under the assumption
πΈ[πβ£π]=0
πΈ[π½^]=π½
i.e., estimator expectation equals the true population parameter.
When does the OLS solution not exist?
When π^ππ is non-invertible (singular/pseudo-singular), typically due to perfect multicollinearity.
What metrics are commonly used to evaluate linear regression?
RMSE
MAE
π ^2 and Adjusted R^2
Cross-validated RMSE/MAE
Prediction intervals (when uncertainty matters)
What does a negative R^2 indicate?
The model performs worse than a horizontal line at the mean of the target variable.
Model is worse than just taking average of data as prediciton.
Why is Adjusted R^2 preferred for model comparison?
Unlike R^2, it penalizes adding irrelevant predictors, preventing inflated model performance.
Where does linear regression appear in agentic AI evaluation?
Reward model fitting (e.g., mapping features β scalar reward)
Calibration of LLM confidence scores
Post-hoc explainability (e.g., linear surrogate models for SHAP/LIME)
Retrieval scoring baselines in RAG (TF-IDF or BM25 often mapped linearly to relevance).
What are the GaussβMarkov assumptions?
Linearity
No perfect multicollinearity
Exogeneity (errors uncorrelated with predictors)
Homoscedasticity
No autocorrelation
Under these, OLS is BLUE (Best Linear Unbiased Estimator).
Is normality an assumption for unbiased coefficients?
No. Normality is needed only for valid hypothesis tests and confidence intervals, not for estimating coefficients.
What steps do you take if residuals show heteroscedasticity?
Transform target (log, BoxβCox)
Use Weighted Least Squares
Use robust standard errors (HuberβWhite)
Switch to models tolerant to heteroscedasticity
How do you detect multicollinearity?
VIF (Variance Inflation Factor)
Condition number
Correlation matrix
Singular values of π^ππ
Why does multicollinearity not harm predictions but harms inference?
Coefficients become unstable and highly sensitive to noise, but combined predictions may still be good if features span the same subspace.
How does omitted variable bias occur?
If an omitted variable is correlated with both the included predictor and the target, coefficients absorb the correlation and become biased.
Compare forward selection, backward selection, and stepwise selection.
Forward: start empty β add predictors that improve model
Backward: start full β remove least useful predictors
Stepwise: mix of both, with bidirectional testing
Why is subset selection often unstable?
Small data perturbations may change which variables are selected; sensitive to correlated features.
Why are LASSO and Ridge more reliable than subset selection?
Because they produce more stable solutions through continuous shrinkage and avoid combinatorial search.
What is leverage vs influence?
Leverage: unusual predictor values (x-space)
Influence: actual impact on fitted regression (e.g., via Cookβs distance).
High leverage β influential unless it also changes predictions.
How do you handle outliers?
Robust regression (Huber, RANSAC)
Winsorizing
Transformations
Check data errors
Use median-based metrics (MAE)
What is the consequence of heteroscedasticity?
OLS remains unbiased but no longer has minimum variance β standard errors become incorrect β hypothesis tests invalid.
How do you test for heteroscedasticity?
BreuschβPagan
White test
Residual plots (funnel shape)
Why doesnβt OLS require normally distributed predictors or residuals for unbiasedness?
Bias depends on expectation of errors conditional on X; normality only affects inference/testing, not point estimation.
How do you check residual normality?
QQ-plots
ShapiroβWilk test
KolmogorovβSmirnov test
Skewness/kurtosis indicators
How does regularization help with multicollinearity?
Ridge shrinks coefficient magnitudes, stabilizing them
LASSO performs feature selection
Both reduce variance inflation.