Section 11: Multiple Linear Regression Analysis Flashcards

Question

What does R-squared (R²) represent?

Answer 1

% of variation in DV explained by the predictors

Answer 2

R² increases, but overfitting may occur

Answer 3

increases amount of explained variance in DV

Answer 4

Model that explains random noise due to too many predictors without justification.

Answer 5

* Simple regression model - use regular R2 * Multiple regression - use adjusted R2

Answer 6

1. **Linearity** between DV (Y) & each **numeric** IV (X). 2. No significant **outliers**. 3. Normality of **numeric** variables & residuals. 4. **Homoscedasticity** of residuals 5. No **multicollinearity** between predictor variables.

Answer 7

* Scatter plots * Residual plots

Answer 8

* linear reg sensitive to outlier effects: must exclude * tested using residual plots

Answer 9

Normal probability plots

Answer 10

* Analysis of variance * Pairwise correlations

Answer 11

* Residuals are equal across regression line * test using residual plots or **hettest** command

Answer 12

When 2 or more IV are highly correlated with eachother

Answer 13

Variance Inflation Factor (VIF).

Answer 14

VIF > 10 suggests multicollinearity.

Answer 15

* Tolerance = 1/VIF. * If < 0.1 → problematic collinearity.

Answer 16

Removing variable with highest VIF, improves results

Answer 17

How much variance of a regression coefficient is inflated due to multicollinearity.

Answer 18

* VIF > 10 → serious multicollinearity. * Tolerance (1/VIF) < 0.1 → problematic variable.

Answer 19

Remove or combine correlated predictors, or collect more data.

Answer 20

pwcorr varlist, sig

Answer 21

Displays p-values for each correlation coefficient.

Answer 22

When |r| > 0.8 between two predictors.

Answer 23

To identify highly correlated variables that may distort coefficient estimates.

Answer 24

graph matrix varlist, half

Answer 25

Pairwise scatterplots among variables to visualize linear relationships or clusters.

Answer 26

Displays only half of the matrix for a cleaner plot.

Answer 27

It reveals strong linear patterns between predictors, confirming correlation seen in numerical checks.

Answer 28

* **vif** – numerical check (Variance Inflation Factor). * **pwcorr varlist, sig** – correlation matrix. * **graph matrix varlist, half** – visual inspection of relationships.

Answer 29

No. Minor or moderate violations are not necessarily fatal—the analysis can still be valid, especially with a large sample

Answer 30

1. Check for **data entry errors** or **extreme outliers.** 2. Try transforming variables (e.g., log, square-root). 3. **Recode** continuous predictors into categories if appropriate. 4. Use **bootstrapping** or robust standard errors. 5. Remove or combine collinear predictors.

Answer 31

With a reasonably large sample size, regression estimates and p-values remain reliable even if assumptions aren’t perfectly met

Answer 32

When sample size is small or violation is severe (e.g., extreme outliers, heavy skewness, strong heteroscedasticity, or multicollinearity).

Answer 33

* Biased or inefficient estimates. * Incorrect p-values and confidence intervals. * Misleading interpretation of predictor effects.

Answer 34

Applying mathematical change (e.g., log Y, √Y) to stabilize variance, reduce skewness, or make relationship more linear.

Answer 35

It re-samples data many times to calculate robust confidence intervals that don’t rely on normality. | MLR

Answer 36

It reduces redundancy among predictors, improving model stability and interpretability.

Answer 37

Don’t discard model—check impact, make small adjustments if needed, and proceed cautiously.

Answer 38

Both numeric and categorical variables.

Answer 39

* 0 = unexposed (reference group) * 1 = exposed (comparison group).

Answer 40

Mean difference in DV between exposed (1) and unexposed (0) groups, controlling for other predictors.

Answer 41

Smokers have, on average, 5.8 units higher Y (e.g., systolic BP) than non-smokers, adjusting for other variables.

Answer 42

mean y-variable value in the unexposed

Answer 43

* By creating dummy (indicator) variables for each category or * by using **i.**prefix to tell Stata to treat the variable as categorical.

Answer 44

Automatically creates dummy variables and selects one category as the baseline (reference).

Answer 45

Numerically coded, e.g. 0, 1, 2, 3 or 1, 2, 3, 4, representing each category.

Answer 46

Baseline (reference) category — often unexposed group.

Answer 47

1️⃣ As a dummy numeric variable 2️⃣ As a categorical variable

Answer 48

Mean difference (increase or decrease) in DV (Y) for each 1-category increase in that predictor.

Answer 49

Mean difference (increase or decrease) in Y for each non-baseline group compared to the baseline group.

Answer 50

Pre-specified variables only

Answer 51

Try multiple possibilities, including transformed ones.

Answer 52

* AIC (Akaike Information Criterion) * BIC (Bayesian Information Criterion).

Answer 53

Lower AIC and/or BIC.

Answer 54

fitstat, saving(m1) fitstat, using(m1)

Answer 55

Decide which regression model fits the data best by comparing their overall performance and goodness-of-fit statistics.

Answer 56

Because adding or removing predictors can change model accuracy, interpretability, and overfitting risk.

Answer 57

Models where one ( simpler) is a subset of the other — the larger model contains all variables of the smaller model plus additional predictors.

Answer 58

Relative quality of a model — it balances model fit and complexity (number of predictors).

Answer 59

Similar to AIC but applies a stronger penalty for extra predictors, favoring more parsimonious (simpler) models.

Answer 60

Describes a model that is simple and uses the minimum number of variables necessary to achieve a good fit to the data

Answer 61

Smaller AIC or BIC values indicate a better-fitting model among those compared.

Answer 62

New model fits worse once complexity is considered — the variable likely doesn’t improve model performance.

Answer 63

New variable improves the model’s overall fit and predictive quality.

Answer 64

fitstat, saving(m1)

Answer 65

fitstat, using(m1)

Answer 66

* R² * adjusted R² * AIC * BIC * log-likelihood * other model-fit measures useful for comparison.

Answer 67

The model with the lower AIC and BIC values.

Answer 68

Because R² always increases when you add predictors — it doesn’t account for overfitting; AIC/BIC adjust for this.

Answer 69

* Model 1: regress sysbp_before weight height i.smoker * Model 2: regress sysbp_before sex age weight height i.smoker * Here, Model 1 is nested within Model 2, because all of its variables also appear in Model 2.

Answer 70

To see whether adding extra variables (like age and sex) significantly improves model fit — using criteria like AIC, BIC, or F-tests.

Answer 71

Runs larger regression model that includes all predictors.

Answer 72

Displays model’s fit statistics (AIC, BIC, R², etc.) & saves them under the name m1 for later comparison.

Answer 73

Runs smaller model — identical to the first but with one predictor removed (e.g., drop “age”).

Answer 74

Compares new smaller model to previous (saved) larger model, showing differences in fit statistics (AIC, BIC, etc.).

Answer 75

1️⃣ Run the full model with all predictors → regress ... 2️⃣ Save its fit stats → fitstat, saving(m1) 3️⃣ Run the smaller model (remove one predictor) → regress ... 4️⃣ Compare to the full model → fitstat, using(m1) ➡ The model with the lower AIC/BIC is preferred — simpler and better fitting.

Answer 76

❌ No — the dependent variable itself does not need to be normally distributed. ✅ What does need to be approximately normal are the **residuals** (errors) of the model.

Answer 77

1. Forward selection 2. Backward selection 3. Stepwise selection 4. All-subset (best-subset) selection

Answer 78

Start with no variables → add predictors one at a time → keep those that significantly improve model fit.

Answer 79

Start with all variables → remove the least significant one each time → stop when all remaining variables are significant.

Answer 80

A mix of forward and backward methods — adds variables step by step and removes any that become non-significant after additions.

Answer 81

Fits every possible combination of predictors and picks the model with the best fit (lowest AIC/BIC, highest adjusted R²).

Section 11: Multiple Linear Regression Analysis Flashcards

(107 cards)