Parameter Estimation Flashcards

(207 cards)

1
Q

Define linear regression model.

A

A model that expresses the mean of a response as a linear function of predictors: y = Xβ + ε.
Intuition: a straight-line/plane relationship plus noise.
Example: Sales = β0 + β1(YouTube) + β2(Facebook) + ε.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define response variable (y).

A

The outcome you want to explain or predict; treated as random before sampling, then observed/fixed in the dataset.
Intuition: what you care about.
Example: sales (thousands of units).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define predictor / covariate (x).

A

An input variable used to explain/predict y; columns of X (besides the intercept).
Intuition: knobs you measure.
Example: YouTube budget.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define parameter (β).

A

Unknown population quantity describing the relationship between predictors and mean response.
Intuition: the ‘true’ intercept/slopes.
Example: β1 = change in mean sales per $1k YouTube.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define error term (ε).

A

Random deviation of an observation from its conditional mean: y = E[y|x] + ε.
Intuition: all unmodeled factors + noise.
Example: random market effects not in ad budgets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define residual (e_i or \hat{ε}_i).

A

Observed deviation from fitted line/surface: e_i = y_i − ŷ_i.
Intuition: ‘leftover’ after fitting.
Example: actual sales minus predicted sales for company i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define fitted value (ŷ_i).

A

Model’s predicted mean response at x_i using estimated coefficients: ŷ_i = x_i^T β̂.
Intuition: point on fitted line/surface.
Example: predicted sales for given budgets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define least squares.

A

Estimation method choosing β̂ to minimize RSS = Σ (y_i − ŷ_i)^2.
Intuition: make residuals small overall.
Example: pick the line with smallest total squared vertical distances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define residual sum of squares (RSS).

A

RSS = (y − Xβ)^T(y − Xβ) = Σ e_i^2.
Intuition: total squared error of fit.
Example: objective function minimized by OLS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define design matrix (X).

A

Matrix whose rows are observations and columns are predictors (first column often all 1s for intercept).
Intuition: table of inputs.
Example: [1, YouTube, Facebook, Newspaper].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define intercept.

A

β0; expected response when all predictors equal 0 (if 0 is meaningful).
Intuition: baseline level.
Example: predicted sales when ad budgets are $0k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define slope / coefficient.

A

βj; change in E[y|x] for a one-unit increase in x_j holding other predictors fixed.
Intuition: partial effect.
Example: change in sales per $1k Facebook, holding YouTube/newspaper fixed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define simple linear regression (SLR).

A

Regression with one predictor: y_i = β0 + β1 x_i + ε_i.
Intuition: fit a line in 2D.
Example: turtle_rating vs income.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define multiple linear regression (MLR).

A

Regression with multiple predictors: y = Xβ + ε.
Intuition: fit a plane/hyperplane.
Example: sales ~ YouTube + Facebook + newspaper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define overdetermined system.

A

More equations than unknowns (n observations > p+1 parameters), so y = Xβ typically has no exact solution.
Intuition: can’t hit every point exactly.
Example: 200 companies but only 4 parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define column space of X (Col(X)).

A

Set of all linear combinations of columns of X.
Intuition: all vectors you can represent as Xβ.
Example: all possible fitted value vectors ŷ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define projection.

A

Mapping a vector to the closest vector in a subspace (in least squares: project y onto Col(X)).
Intuition: ‘shadow’ of y on the model space.
Example: ŷ is the projection of y onto Col(X).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define orthogonality.

A

Two vectors are orthogonal if their dot product is 0.
Intuition: 90° angle; no linear association in that geometry.
Example: residual vector is orthogonal to Col(X) at the OLS solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define normal equations.

A

Equations X^T X β̂ = X^T y that characterize the OLS solution when X^T X is invertible.
Intuition: set derivative to zero.
Example: solve for β̂ via linear system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define hat matrix (H).

A

H = X(X^T X)^{-1}X^T; maps y to fitted values: ŷ = Hy.
Intuition: linear ‘smoother’ putting the ‘hat’ on y.
Example: compute leverage from diag(H).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define leverage (h_ii).

A

Diagonal element of H; measures how extreme x_i is in predictor space.
Intuition: unusual x gives high leverage.
Example: a company with very large YouTube/Facebook spend.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Define identity matrix (I).

A

Square matrix with 1s on diagonal, 0 elsewhere; acts like 1 in matrix multiplication.
Intuition: ‘do nothing’ operator.
Example: (X^T X)^{-1}(X^T X) = I.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define transpose.

A

Operation swapping rows/columns: (A^T){ij} = A{ji}.
Intuition: flip across diagonal.
Example: (X^T X) is symmetric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define symmetric matrix.

A

A matrix A such that A^T = A.
Intuition: mirror across diagonal.
Example: X^T X is always symmetric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Define quadratic form.
Expression v^T A v producing a scalar. Intuition: generalized 'squared length'. Example: (y − Xβ)^T(y − Xβ).
26
Define 2-norm.
||v||_2 = sqrt(v^T v). Intuition: Euclidean length. Example: minimizing RSS is minimizing ||y − Xβ||_2^2.
27
Define assumption: mean-zero errors.
E[ε_i] = 0 for all i. Intuition: errors centered at zero; no systematic shift. Example: residuals should be centered around 0.
28
Define assumption: linearity of mean.
E[y_i] = x_i^T β (equivalently E[y|x] is linear in predictors). Intuition: straight-line mean trend. Example: expected sales increases linearly with budgets (given model).
29
Define assumption: uncorrelated errors.
Cov(ε_i, ε_j) = 0 for i≠j. Intuition: noise of one observation doesn’t predict another. Example: companies’ unexplained shocks are independent-ish.
30
Define assumption: constant variance (homoskedasticity).
Var(ε_i)=σ^2 constant across i. Intuition: same noise level for all x. Example: sales variability doesn’t grow with ad budget.
31
Define assumption: full rank / invertibility.
X has full column rank so X^T X is invertible. Intuition: no perfect multicollinearity. Example: YouTube not an exact linear combo of Facebook/newspaper.
32
Define Gauss–Markov theorem.
Under mean-zero, linearity, uncorrelated errors with constant variance, and full rank, OLS is BLUE: best (minimum variance) among unbiased linear estimators. Intuition: OLS is optimal in a specific class. Example: among unbiased linear estimators of β, OLS variance is smallest.
33
Define BLUE.
Best Linear Unbiased Estimator. Intuition: within linear unbiased estimators, you can’t beat OLS variance. Example: OLS beats any other linear unbiased estimator.
34
Define bias.
E[estimator] − true parameter. Intuition: systematic error. Example: biased slope consistently overshoots β1.
35
Define variance of estimator.
How much an estimator varies across repeated samples. Intuition: stability. Example: high-variance β̂ jumps around between samples.
36
Define maximum likelihood estimation (MLE).
Chooses parameter values maximizing the likelihood (joint density of data as a function of parameters). Intuition: make observed data most 'probable' under the model. Example: with normal errors, MLE for β equals OLS.
37
Define normality assumption (for MLE equivalence).
Assume ε_i ~ Normal(0, σ^2) i.i.d. Intuition: bell-shaped noise. Example: then maximizing log-likelihood ↔ minimizing RSS.
38
Define log-likelihood.
Log of likelihood; easier to maximize and has same maximizer. Intuition: turns products into sums. Example: normal log-likelihood contains −(1/2σ^2)Σ(y_i−μ_i)^2.
39
Define i.i.d..
Independent and identically distributed. Intuition: same distribution, no dependence. Example: ε_i are independent with same σ^2.
40
Define fitting vs predicting vs explaining.
Fit: estimate β; explain: interpret β; predict: compute ŷ for new x. Intuition: three distinct goals. Example: budgets → sales interpretation vs forecast.
41
Define double dipping.
Using the same data for exploration and formal inference can inflate false positives. Intuition: 'peeking' then testing. Example: exploring correlations then claiming significance on same dataset.
42
Define outlier (IQR rule).
Point beyond Q3+1.5·IQR or below Q1−1.5·IQR. Intuition: unusually large/small value. Example: very high newspaper budget flagged in boxplot.
43
Define correlation.
Measure of linear association between two variables (−1 to 1). Intuition: strength of straight-line relationship. Example: corr(sales, YouTube) ≈ 0.78 in marketing data.
44
Define scatterplot diagnostic (EDA).
Plot pairs to see form/strength/outliers/heteroskedasticity. Intuition: look before modeling. Example: sales vs YouTube shows curvature + increasing spread.
45
What is vertical distance (residual in SLR)?
In SLR, residual is the vertical difference between a point and the fitted line: e_i = y_i − (β̂0+β̂1 x_i). Intuition: how far up/down the point is from the line. Example: at x=10, actual y=25, predicted y=22 ⇒ residual=3.
46
What is why square residuals in least squares??
Squaring makes deviations positive, penalizes large errors more, and yields a smooth objective with a closed-form solution. Intuition: big mistakes matter a lot. Example: residuals 2 and −2 both contribute 4.
47
What is OLS estimator (definition)?
β̂_OLS is the β that minimizes RSS = (y−Xβ)^T(y−Xβ). Intuition: best-fitting hyperplane in squared-error sense. Example: computed by lm() in R.
48
What is what is being minimized in OLS (matrix view)??
The squared 2-norm of the residual vector: ||y − Xβ||_2^2. Intuition: shortest distance from y to Col(X). Example: distance from y to its projection ŷ.
49
What is geometric meaning of ŷ?
ŷ is the orthogonal projection of y onto Col(X). Intuition: closest point in model space. Example: ŷ lies in span of X columns.
50
What is geometric meaning of residual vector e?
e = y − ŷ is orthogonal to Col(X): X^T e = 0 at OLS solution. Intuition: leftover is perpendicular. Example: normal equations are orthogonality conditions.
51
What is what does 'full column rank' mean??
Columns of X are linearly independent; no column is an exact linear combo of others. Intuition: each predictor adds unique information. Example: not having Facebook = 2*YouTube exactly.
52
What is why X^T X can be non-invertible?
If predictors are perfectly collinear or p+1 > n, X^T X is singular. Intuition: cannot uniquely identify β. Example: duplicate predictor column.
53
What is what lm() returns (conceptually)?
Least-squares estimates (β̂), fitted values ŷ, residuals e, and diagnostics based on the OLS fit. Intuition: a complete OLS summary. Example: summary(lm(...)) gives estimates, SEs, t, p.
54
What is role of QR factorization in lm()?
Numerically stable method to compute OLS without explicitly inverting X^T X. Intuition: avoid unstable matrix inversion. Example: lm() typically uses QR decomposition.
55
What is why treat y as fixed after sampling??
Once observed, y values are realizations; estimation conditions on observed data. Intuition: you can’t change the sample you got. Example: OLS uses fixed y and X to estimate β.
56
What is what does 'linear' mean in 'linear regression'??
Linear in parameters β (not necessarily linear in x after transformations). Intuition: β enter as a linear combination. Example: y = β0 + β1 log(x) is linear in β.
57
What is what is the systematic part Xβ??
The mean structure explained by predictors. Intuition: what the model tries to capture. Example: predicted sales from budgets.
58
What is what is the random part ε??
Unexplained variability not captured by predictors. Intuition: noise + omitted factors. Example: competitor actions, seasonality.
59
What is what does 'best' mean in Gauss–Markov??
Minimum variance among all linear unbiased estimators. Intuition: most precise unbiased linear β̂. Example: OLS beats other linear unbiased β estimators.
60
What is why Gauss–Markov doesn’t require normality?
BLUE result uses only mean/variance/covariance assumptions, not distribution shape. Intuition: geometry + second moments. Example: OLS is BLUE even with non-normal errors.
61
What is what normality is mainly used for?
Exact finite-sample t/F inference and MLE equivalence; large-sample inference can rely on CLT. Intuition: distributional convenience. Example: normal residuals justify t-tests exactly.
62
What is what 'projection' implies about closeness?
ŷ minimizes ||y − v|| over all v in Col(X). Intuition: closest point in that subspace. Example: no other fitted-value vector has smaller RSS.
63
What is what does the hat matrix do??
Maps observed responses to fitted values: ŷ = Hy; residuals are (I−H)y. Intuition: linear operator for fitting. Example: compute fitted values via matrix multiplication.
64
What is why called 'hat' matrix?
Because it puts a 'hat' on y: y → ŷ. Intuition: 'hats' denote estimates. Example: ŷ = Hy.
65
What is what is a 'tall' matrix?
More rows than columns (n > p+1). Intuition: more data points than parameters. Example: 200×4 design matrix.
66
What is what does it mean that system is overdetermined?
Too many equations to satisfy exactly; must approximate. Intuition: can’t pass through all points. Example: pick best plane instead of exact match.
67
Distinguish: residual vs error.
Residual e_i = y_i − ŷ_i is computed from data; error ε_i is unobserved random variable in the true model. Intuition: residual is what you see; error is what generated the data. Example: ε_i unknown even after fitting; e_i is output from lm().
68
Distinguish: fitted values vs predictions.
Fitted values ŷ_i are predictions for observed x_i; prediction for new x_* uses same form but with new inputs. Intuition: in-sample vs out-of-sample. Example: predict(lmfit, newdata=...).
69
Distinguish: linearity in parameters vs linear in x.
Model is linear if it’s a linear combo of β’s; x can be transformed. Intuition: β enter without being multiplied together. Example: y=β0+β1 x+β2 x^2 is linear in β.
70
Distinguish: uncorrelated errors vs independent errors.
Uncorrelated: Cov(ε_i,ε_j)=0; independent: stronger, implies uncorrelated (for finite variance). Intuition: independence is a bigger assumption. Example: Gauss–Markov needs uncorrelated; MLE derivation used independence.
71
Distinguish: constant variance vs normality.
Homoskedasticity is about Var(ε|x); normality is about distribution shape. Intuition: spread vs shape. Example: errors can be non-normal but homoskedastic.
72
Distinguish: unbiased vs efficient.
Unbiased means correct on average; efficient means smallest variance among a class. Intuition: accuracy vs precision. Example: OLS is unbiased and (within linear unbiased) efficient.
73
Distinguish: OLS solution existence vs uniqueness.
Existence: minimizer always exists; uniqueness requires full column rank (invertible X^T X). Intuition: you can minimize, but might be many solutions. Example: perfect collinearity → multiple β give same ŷ.
74
Distinguish: Col(X) vs Row(X).
Col(X) is span of columns (possible ŷ); Row(X) is span of rows. Intuition: fitted values live in Col(X). Example: projection is onto Col(X), not Row(X).
75
Given the model/data, Compute residual e_i.
Steps: 1) Fit model to get β̂. 2) Compute ŷ_i = x_i^T β̂. 3) e_i = y_i − ŷ_i. Formula: e = y − ŷ. When: after fitting to assess model fit. Mistake: confusing e_i with ε_i (unobserved).
76
Given the model/data, Compute fitted value ŷ_i.
Steps: 1) Write x_i row incl. intercept (1, x_i1,...,x_ip). 2) Multiply by β̂. Formula: ŷ_i = x_i^T β̂. When: prediction at observed x. Mistake: forgetting intercept or dummy coding.
77
Given the model/data, Write RSS in matrix form.
RSS(β) = (y − Xβ)^T (y − Xβ) = ||y − Xβ||_2^2. When: deriving OLS. Mistake: treating as scalar without respecting matrix order.
78
Given the model/data, Derive normal equations (result).
Set gradient to 0: ∂/∂β RSS = −2X^T y + 2X^T Xβ = 0 ⇒ X^T Xβ = X^T y. When: computing β̂. Mistake: dropping transposes or mixing order.
79
Given the model/data, Compute OLS β̂ (closed form).
If X^T X invertible: β̂ = (X^T X)^{-1} X^T y. When: theoretical derivations; software uses numerically stable methods. Mistake: explicitly inverting in code unnecessarily.
80
Given the model/data, Compute hat matrix H.
H = X (X^T X)^{-1} X^T. Then ŷ = Hy; e = (I − H)y. When: leverage/influence theory. Mistake: H is n×n, not (p+1)×(p+1).
81
Given the model/data, Compute leverage h_ii.
h_ii = (H)_{ii}. Steps: compute H (or use lm.influence()) and take diagonal. When: identify high-leverage points. Mistake: confusing leverage with residual size.
82
Given the model/data, Show X^T X is symmetric.
(X^T X)^T = X^T (X^T)^T = X^T X. When: matrix algebra in derivations. Mistake: forgetting reverse order when transposing products.
83
Given the model/data, Derivative: if y = Xv, compute ∂y/∂v.
Result: ∂(Xv)/∂v = X. When: gradient derivations. Mistake: mixing scalar and vector derivatives.
84
Given the model/data, Derivative: quadratic form v^T (X^T X) v.
Result: ∂/∂v [v^T (X^T X) v] = 2(X^T X)v. When: derivative of β^T X^T X β. Mistake: missing factor of 2.
85
Given the model/data, MLE log-likelihood for normal errors (up to constants).
ℓ(β) = const − (1/(2σ^2)) Σ (y_i − μ_i)^2 where μ_i = x_i^T β. Maximizing ℓ ↔ minimizing RSS. Mistake: forgetting negative sign or constants.
86
Given the model/data, R: fit simple linear regression.
Code: fit <- lm(y ~ x, data=df) Extract: coef(fit), fitted(fit), resid(fit), summary(fit). Mistake: forgetting data= or variable names.
87
Given the model/data, R: fit multiple regression.
Code: fit <- lm(y ~ x1 + x2 + x3, data=df) Mistake: using * when you meant + (adds interaction).
88
Given the model/data, R: get coefficient table.
Code: summary(fit)$coefficients Gives: Estimate, Std. Error, t value, Pr(>|t|). Mistake: interpreting p-value as effect size.
89
Given the model/data, R: predicted values for new data.
Code: predict(fit, newdata=newdf) Mistake: newdf must have same predictor names and factor levels.
90
Given the model/data, R: check missing values quickly.
Code: sum(is.na(df)) Mistake: NA vs other missing codes (e.g., 999).
91
Given the model/data, R: boxplot + outlier flag (IQR rule).
Code: boxplot(df$var) Outliers: boxplot.stats(df$var)$out Mistake: deleting outliers automatically without investigation.
92
Given the model/data, R: correlation matrix.
Code: cor(df) Mistake: using correlation to imply causation.
93
Given the model/data, R: pairs plot.
Code: pairs(df) Mistake: concluding linearity without checking curvature/variance patterns.
94
Derivation step: Expand RSS.
RSS = (y−Xβ)^T(y−Xβ) = y^T y − y^T Xβ − β^T X^T y + β^T X^T Xβ. Because middle terms are scalars, y^T Xβ = (β^T X^T y)^T = β^T X^T y, so combine to −2β^T X^T y.
95
Derivation step: Take gradient.
∂/∂β [y^T y] = 0; ∂/∂β [−2β^T X^T y] = −2X^T y; ∂/∂β [β^T X^T Xβ] = 2X^T Xβ.
96
Derivation step: Solve.
Set gradient to 0: −2X^T y + 2X^T Xβ = 0 ⇒ X^T Xβ = X^T y ⇒ β̂ = (X^T X)^{-1}X^T y (if invertible).
97
True/False: Residuals are the same thing as the model errors ε_i.
False. Residuals e_i are computed from fitted model: e_i = y_i − ŷ_i. Errors ε_i are unobserved random variables in the true data-generating process. Exam trick: they use ε and ê interchangeably—watch the hat.
98
If X^T X is invertible, should you compute β̂ in R by solve(t(X)%*%X)%*%t(X)%*%y?
Usually no. It's numerically less stable than QR-based methods (what lm() uses). Correct: use lm() (or qr.solve / crossprod with care). Exam trick: asks for 'the formula' vs 'recommended computation'.
99
True/False: Least squares requires predictors to be normally distributed.
False. OLS estimation does not require predictors to be normal. Normality is about errors (and mostly for inference). Exam trick: swaps 'predictors' with 'errors'.
100
True/False: Gauss–Markov theorem assumes normal errors.
False. Gauss–Markov uses mean-zero, linearity, uncorrelated errors with constant variance, and full rank. Exam trick: confuses BLUE result with t-test assumptions.
101
A student says: 'Because MLE = OLS, OLS is always optimal.' What’s wrong?
MLE=OLS only under added assumptions (e.g., normal i.i.d. errors). 'Optimal' depends on criterion/class. With heteroskedasticity or correlation, OLS isn’t BLUE and other estimators may be better. Exam trick: overgeneralizing a conditional result.
102
True/False: In an overdetermined system y=Xβ, there is no exact solution, so OLS finds an approximate β that makes Xβ closest to y.
True (typical case). OLS chooses β minimizing ||y−Xβ||^2, i.e., closest point in Col(X). Exam trick: they might say OLS 'solves' y=Xβ exactly—wrong.
103
A student interprets β0 as 'sales when budgets are zero', but zero budgets may be outside data range. What’s the issue?
Intercept interpretation may be extrapolation or meaningless if x=0 isn’t plausible/observed. Exam trick: intercept interpretation requires context.
104
True/False: High leverage means a point has a large residual.
False. Leverage depends on x-values (geometry), not y. A point can have high leverage but small residual (fits the model well). Exam trick: conflates leverage with outlier in y.
105
True/False: Minimizing Σ|e_i| gives the same solution as least squares.
False. L1 loss (least absolute deviations) yields a different estimator (median-based, not closed form like OLS). Exam trick: assumes all loss functions equivalent.
106
If residuals look non-normal, can OLS β̂ still be unbiased?
Yes—unbiasedness depends on mean-zero and exogeneity-type assumptions, not normality. Non-normality mainly affects exact small-sample inference. Exam trick: treats normality as required for unbiasedness.
107
True/False: Because X^T X is symmetric, it must be invertible.
False. Symmetric does not imply invertible; singular symmetric matrices exist (det=0). Exam trick: 'symmetric' ≠ 'full rank'.
108
True/False: The middle terms in RSS expansion are always equal because matrix multiplication is commutative.
False reasoning. They are equal because each is a scalar and equal to its transpose, not because multiplication commutes. Exam trick: tries to bait you into saying matrices commute.
109
A question says: 'The least squares line minimizes squared *horizontal* distances.' Correct it.
OLS minimizes squared *vertical* distances in y-direction (residuals), not horizontal x distances. Exam trick: swaps axes.
110
True/False: If you add a predictor perfectly correlated with an existing predictor, β̂ is still unique.
False. Perfect collinearity makes X^T X singular; β̂ is not uniquely identified. Exam trick: 'more predictors always better'.
111
A student says: 'Correlation between y and x proves x causes y.' What's the fix?
Correlation shows association, not causation. Causal claims require design/assumptions beyond regression. Exam trick: causal language in a regression course.
112
Pitfall #1: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
113
Pitfall #2: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
114
Pitfall #3: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
115
Pitfall #4: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
116
Pitfall #5: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
117
Pitfall #6: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
118
Pitfall #7: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
119
Pitfall #8: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
120
Pitfall #9: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
121
Pitfall #10: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
122
Pitfall #11: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
123
Pitfall #12: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
124
Pitfall #13: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
125
Pitfall #14: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
126
Pitfall #15: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
127
Pitfall #16: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
128
Pitfall #17: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
129
Pitfall #18: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
130
Pitfall #19: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
131
Pitfall #20: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
132
Pitfall #21: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
133
Pitfall #22: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
134
Pitfall #23: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
135
Pitfall #24: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
136
Pitfall #25: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
137
Pitfall #26: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
138
Pitfall #27: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
139
Pitfall #28: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
140
Pitfall #29: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
141
Pitfall #30: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
142
Pitfall #31: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
143
Pitfall #32: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
144
Pitfall #33: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
145
Pitfall #34: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
146
Pitfall #35: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
147
Pitfall #36: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
148
Pitfall #37: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
149
Pitfall #38: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
150
Pitfall #39: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
151
Pitfall #40: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
152
Pitfall #41: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
153
Pitfall #42: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
154
Pitfall #43: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
155
Pitfall #44: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
156
Pitfall #45: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
157
Pitfall #46: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
158
Pitfall #47: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
159
Pitfall #48: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
160
Pitfall #49: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
161
Pitfall #50: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
162
Pitfall #51: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
163
Pitfall #52: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
164
Pitfall #53: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
165
Pitfall #54: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
166
Pitfall #55: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
167
Pitfall #56: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
168
Pitfall #57: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
169
Pitfall #58: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
170
Pitfall #59: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
171
Pitfall #60: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
172
Pitfall #61: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
173
Pitfall #62: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
174
Pitfall #63: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
175
Pitfall #64: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
176
Pitfall #65: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
177
Pitfall #66: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
178
Pitfall #67: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
179
Pitfall #68: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
180
Pitfall #69: A question says: 'H is (p+1)×(p+1)'. Correct it.
H is n×n because it maps an n-vector y to an n-vector ŷ. Watch for: mixing parameter dimension with observation dimension.
181
Pitfall #70: True/False: If you re-scale a predictor (e.g., dollars to thousands), β̂ changes but fitted values ŷ stay the same (if you adjust β accordingly).
True. Coefficient units change; predictions remain invariant to linear re-scaling. Watch for: interpreting β without units.
182
Pitfall #71: True/False: ŷ always equals y for each observation.
False. Only true if the model fits perfectly (RSS=0) or if p+1=n and X full rank (interpolation), which is not typical. Watch for: confusing fitted values with observed values.
183
Pitfall #72: You see 'X is tall' (n>p+1). Does that guarantee X^T X is invertible?
No. Tall helps, but you also need columns linearly independent (no perfect multicollinearity). Watch for: 'tall' ≠ 'full rank'.
184
Pitfall #73: True/False: If errors are uncorrelated, they must be independent.
False. Independence ⇒ uncorrelated (with finite variance), but uncorrelated does not imply independent. Watch for: reversing implication.
185
Pitfall #74: A claim: 'Because OLS minimizes RSS, it minimizes the variance of residuals.' What’s wrong?
OLS minimizes RSS for given X and y, not the sampling variance of residuals; and residual variance depends on σ^2 and model fit. Watch for: mixing optimization objective with sampling properties.
186
Pitfall #75: True/False: The projection ŷ is the closest vector to y in all of R^n.
False. It’s closest among vectors in Col(X) only. Watch for: forgetting 'within the subspace'.
187
Define total sum of squares (TSS).
TSS = Σ(y_i − ȳ)^2; total variability around the mean. Intuition: variability with no predictors. Example: sample variance × (n−1).
188
Define explained sum of squares (ESS).
ESS = Σ(ŷ_i − ȳ)^2; variability explained by the regression. Intuition: gain from using model vs mean. Example: improvement over y-bar model.
189
Define residual sum of squares (RSS).
RSS = Σ(y_i − ŷ_i)^2; unexplained variability. Intuition: leftover noise. Example: vertical deviations from fitted line.
190
Define ANOVA decomposition.
TSS = ESS + RSS (for models with intercept). Intuition: total = explained + unexplained.
191
Define degrees of freedom (residual).
df_resid = n − (p+1); loss from estimating parameters. Intuition: info left after fitting.
192
Define error variance σ².
Population variance of ε; noise level. Intuition: spread around true regression.
193
Define estimator of σ².
σ̂² = RSS / (n − p − 1); unbiased estimator. Intuition: average squared residual.
194
Define coefficient of determination (R²).
R² = 1 − RSS/TSS = ESS/TSS. Intuition: proportion of variability explained.
195
Define adjusted R².
Adjusted R² penalizes extra predictors. Intuition: fairer model comparison.
196
Define non-identifiability.
Parameters cannot be uniquely estimated when XᵀX is singular. Intuition: redundant predictors.
197
Define perfect multicollinearity.
Exact linear dependence among predictors. Intuition: measuring same thing twice.
198
Define near multicollinearity.
Predictors highly correlated but not exact multiples. Intuition: unstable estimates.
199
How do you Compute TSS?
TSS = Σ(y_i − ȳ)^2. Use: total variability. Mistake: forgetting intercept requirement.
200
How do you Compute ESS?
ESS = Σ(ŷ_i − ȳ)^2. Use: explained variability.
201
How do you Compute σ̂²?
σ̂² = RSS / (n − p − 1). Use: inference. Mistake: dividing by n.
202
How do you Compute R²?
R² = 1 − RSS/TSS. Use: descriptive fit. Mistake: using for causal claims.
203
How do you ANOVA in R?
anova(lm_fit) extracts ESS and RSS. Mistake: confusing rows.
204
True/False: High R² implies correct model.
False. Wrong functional form can still yield high R².
205
True/False: R² can decrease when adding predictors.
False. It never decreases; use adjusted R².
206
Seeing NA coefficients in lm() means what?
Exact non-identifiability; drop a predictor.
207
Opposite-signed coefficients than truth suggest?
Possible multicollinearity.