Define the least squares, and state the Gauss-Markov theorem
βˆ = (X^T X )^{−1} X^T Y
and is the best linear unbiased estimator (BLUE), but others can have a smaller mse.
What is the purpose of Ridge regression?
State βˆridge and its expectation
βˆridge = (X^T X + λIp)^{−1} X^T Y
with E[βˆridge] = (X^T X + λIp)^{−1} X^T E[Y] = (X^T X + λIp)^{−1} X^T Xβ
What are the difference between Lasso and Ridge?
State the residual sum of squares of a multivariate linear model
RSS(β) = e^T e
and ∂^2RSS(β) / ∂β∂β^T = 2X^TX
What is the z-score?
z_j = β^_j / [ sigma^hat sqrt(v_j) ]
where v_j is the jth diagonal entry of (X^T X)^{-1}
sigma^hat 2 = (1 / (n-p)) sum_{i=1}^n (Y_i - Y^hat_i)2
What is the p-value and its interpretation?
Probability of getting a more extreme value than z_j is p_{z_j} = 2 Φ (z_j)
where Φ is the CDF of the normal distribution and Z ~(0,1)
Let H_0 : β_j = 0 we have that z_j ~ t_{n-p}
If p < 0.01 strong evidence to reject H_0, for p < 0.05 some evidence to reject H0
What is the F score and its interpretation?
A way to compare one model M1 with p1+1 parameters to another model M0 with p0+1 parameters (contained in M1).
F = [ (RSS0 - RSS1) / (p1 - p0) ] / [ RSS1 / (n-p1-1) ]
q99 = qf(0.99, df1 = p1 - p0, df2 = n-p1-1)
- Reject H0 if F > 99% quantile of F
(can start with M1 all variables and M0 containing no variables and adding variables one by one).
How is ridge regression a shrinkage method?
Using the SVD decomposition X=UDV^T (U, V orthogonal):
Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y
which shrinks the jth coordinates with small dj^2 (jth element of diagonal D)
How is ridge regression a shrinkage method?
Using the SVD decomposition X=UDV^T (U, V orthogonal):
Xβˆridge = UD (D^2 +λ Ip)^(-1) DU^T Y = sum (j=1 to p) uj dj^2 / (dj^2 + λ) uj Y
which shrinks the jth coordinates with small dj^2 (jth element of diagonal D), ie. small variance (because less stable)
What is the link between SVD and Principal Components?
the sample covariance matrix of centred data X is S = X^T X /n
Therefore, the variance of X in direction vj is dj^2 / n , where vj is the jth eignevector from eigendecomposition of X^T X, ie. the jth principal component
What is the hat matrix and its degrees of freedom?
Hλ = (X^T X + λ Ip) ^{-1} X^T s.t. β^ridge = HY = Y^ (fitted values)
and the effective degrees of freedom is df (λ) = tr{Hλ} = sum (j=1 to p) dj^2 / (dj^2 + λ)
Recall vector and matrix differentiation formulas
- if y = x^T A x then ∂y / ∂x = 2Ax
What is the the least squares expectation and variance and the assymtpotical distributions?
E(βˆ) = β var(βˆ) = σ^2 (XT X)^(-1)
βˆ ≈ N(β, σ^2 (XT X)^(-1) )
What is lasso coefficients for orthogonal X?
βjˆ lasso = sgn(βjˆls) (βjˆ ls - λ)+
ie. βjˆ lasso = 0 if βjˆls - λ <= 0
What is principal components regression?
Similar to ridge regression but sets to 0 instead of shrinking.
What is the least angle regression?
A method for variable selection with algorithm similar to forward stepwise regression:
How to carry a stepwise regression (backward and for?
Compare Lasso and LAR
purpose is the same but LAR more computationally efficient
What is the Principal Component Regression?
Compare Ridge regression and PCR
Same technique of identifying variables with small variance but PCR removes coefficients whereas Ridge regression shrinks coefficients
Mention diagnostic plots to evaluate fit