Section 11: Multiple Linear Regression Analysis Flashcards

(107 cards)

1
Q

What does multiple linear regression model?

A

Models relationship between numeric DV & several IV (numeric or categorical).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is another name for multiple linear regression?

A

Multivariate linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the multiple linear regression equation?

A

Y = B0 + B1X1 + B2X2 + … + BpXp + ε

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the fitted line represent in multiple regression?

A

best-fit line through a multi-dimensional data space (multi-dimensional scatter plot).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What can the independent variables in a multiple regression model be?

A

Numeric or categorical.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What type of dependent variable does MLR require?

multiple linear regression

A

Numeric (continuous) dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many beta coefficients (β) are estimated in multiple linear regression?

A

One for each predictor variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does each beta coefficient represent?

A

Estimated change (increase/decrease) in DV (Y) for each 1-unit increase in predictor variable
* assuming all other predictors are constant.
* controlling/adjusting for the effects of other predictors variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does “controlling for the effects of other predictors” mean?

A

Holding all other variables constant while estimating the unique effect of one predictor on the outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does 𝛽0 represent for MLR equation?

A

intercept - predicted value of 𝑌 when all 𝑋 variables are equal to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does
Y represent in the MLR equation?

A

Dependent (outcome) variable - numeric variable being predicted or explained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do coefficients B1, B2, …., Bp represent in MLR equation?

linear reg

A

regression coefficients - estimated change in 𝑌 for a 1-unit increase in each 𝑋, controlling for all other variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the term ε represent?

A
  • Error term (residual) - difference between observed & predicted values of 𝑌.
  • random error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

hat is the practical meaning of a positive β coefficient?

A

As predictor increases, DV increases, assuming other predictors remain constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the meaning of a negative β coefficient?

A

As exosure increases, DV decreases, holding other predictors constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does it mean when a β coefficient is 0 or not statistically significant?

A

no evidence of an association between that predictor and the outcome after adjusting for others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the form of the simple linear regression equation?

A

Y=β0​+β1​X+ε

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does multiple regression extend the equation of simple linear regression?
Y=β0​+β1​X+ε

A

By adding more predictors (β₂X₂, β₃X₃, etc.) to account for multiple influences on 𝑌

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What stata command for a multiple linear regression equation?

A

regress dependent independent1 independent2 independent3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In a model with categorical predictors, how are dummy variables handled in Stata?

A

Prefix variable with i. (e.g., i.smoker)- this adds one β for each non-baseline category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Example: Interpret β = 1.07 for weight (p < 0.001).

A

For every 1-kg increase in weight, systolic BP increases on average by 1.07 units, adjusting for height, age, sex, and smoking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Example: Interpret β = −0.89 for height (p < 0.001).

A

For every 1-cm increase in height, systolic BP decreases by 0.89 units, controlling for other variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the two main uses of multiple regression?

A
  1. Measure strength of effect that IV has on DV (controlling for confounding).
  2. Forecast effects/impacts of changes in exposure variables on outcome variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Confounding vs mediator vs effect modifier?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What does R-squared (R²) represent?
% of variation in DV explained by the predictors
26
What happens when more variables are added?
R² increases, but overfitting may occur
27
What happens when we add IV to a MLR model? | multiple linear regression
increases amount of explained variance in DV
28
What is an overfit model?
Model that explains random noise due to too many predictors without justification.
29
When we look at a table for LR, which R2 value do we look at?
* Simple regression model - use regular R2 * Multiple regression - use adjusted R2
30
List the 5 key assumptions for MLR
1. **Linearity** between DV (Y) & each **numeric** IV (X). 2. No significant **outliers**. 3. Normality of **numeric** variables & residuals. 4. **Homoscedasticity** of residuals 5. No **multicollinearity** between predictor variables.
31
How is linearity tested for MLR? | Assumption 1
* Scatter plots * Residual plots
32
How can we find if there are no significant outliers for MLR? | Assumption 2
* linear reg sensitive to outlier effects: must exclude * tested using residual plots
33
How is normality of numeric variables & of residuals tested for MLR? | Assumtpion 3
Normal probability plots
34
How is no multicollinearity between predictor variables tested for MLR? | Assumption 5
* Analysis of variance * Pairwise correlations
35
How can we test homoscedasticity of residuals for MLR? | Assumption 4
* Residuals are equal across regression line * test using residual plots or **hettest** command
36
When does multicollinearity occur?
When 2 or more IV are highly correlated with eachother
37
What statistic detects multicollinearity?
Variance Inflation Factor (VIF).
38
When is VIF too high?
VIF > 10 suggests multicollinearity.
39
What is tolerance and its rule?
* Tolerance = 1/VIF. * If < 0.1 → problematic collinearity.
40
How can you handle high VIF values?
Removing variable with highest VIF, improves results
41
Which Stata command checks for multicollinearity using Variance Inflation Factors?
vif
42
What does the VIF value tell you?
How much variance of a regression coefficient is inflated due to multicollinearity.
43
What VIF values suggest a problem?
* VIF > 10 → serious multicollinearity. * Tolerance (1/VIF) < 0.1 → problematic variable.
44
What action can you take if VIF is high?
Remove or combine correlated predictors, or collect more data.
45
Which Stata command displays pairwise correlations between predictors?
pwcorr varlist, sig
46
What does the sig option do?
Displays p-values for each correlation coefficient.
47
What correlation values indicate potential multicollinearity?
When |r| > 0.8 between two predictors.
48
Why use pairwise correlations before regression?
To identify highly correlated variables that may distort coefficient estimates.
49
Which Stata command produces a visual matrix of relationships among variables?
graph matrix varlist, half
50
What does the scatterplot matrix show?
Pairwise scatterplots among variables to visualize linear relationships or clusters.
51
What does the half option do?
Displays only half of the matrix for a cleaner plot.
52
How can this plot help with multicollinearity?
It reveals strong linear patterns between predictors, confirming correlation seen in numerical checks.
53
Which three Stata tools are used together to assess multicollinearity?
* **vif** – numerical check (Variance Inflation Factor). * **pwcorr varlist, sig** – correlation matrix. * **graph matrix varlist, half** – visual inspection of relationships.
54
Does violating an assumption always make the regression invalid? | MLR
No. Minor or moderate violations are not necessarily fatal—the analysis can still be valid, especially with a large sample
55
Possible fixes if assumptions are violated? | MLR
1. Check for **data entry errors** or **extreme outliers.** 2. Try transforming variables (e.g., log, square-root). 3. **Recode** continuous predictors into categories if appropriate. 4. Use **bootstrapping** or robust standard errors. 5. Remove or combine collinear predictors.
56
What does “results may be robust to violation of assumptions” mean? | MLR
With a reasonably large sample size, regression estimates and p-values remain reliable even if assumptions aren’t perfectly met
57
When are assumption violations more serious? | MLR
When sample size is small or violation is severe (e.g., extreme outliers, heavy skewness, strong heteroscedasticity, or multicollinearity).
58
What are the possible effects of serious assumption violations? | MLR
* Biased or inefficient estimates. * Incorrect p-values and confidence intervals. * Misleading interpretation of predictor effects.
59
What does “transforming variables” mean? | MLR
Applying mathematical change (e.g., log Y, √Y) to stabilize variance, reduce skewness, or make relationship more linear.
60
How does bootstrapping help when assumptions are violated? | MLR
It re-samples data many times to calculate robust confidence intervals that don’t rely on normality. | MLR
61
How does removing collinear predictors help? | MLR
It reduces redundancy among predictors, improving model stability and interpretability.
62
What is the main takeaway when assumptions are slightly violated?
Don’t discard model—check impact, make small adjustments if needed, and proceed cautiously.
63
What kind of predictors can be used in multiple linear regression?
Both numeric and categorical variables.
64
How are binary categorical variables coded in regression? | MLR
* 0 = unexposed (reference group) * 1 = exposed (comparison group).
65
What does the regression coefficient (β) for a binary variable represent? | MLR
Mean difference in DV between exposed (1) and unexposed (0) groups, controlling for other predictors.
66
Example: If β for smoker = 5.8, interpret it. | MLR
Smokers have, on average, 5.8 units higher Y (e.g., systolic BP) than non-smokers, adjusting for other variables.
67
What is the constant? | MLR
mean y-variable value in the unexposed
68
How are categorical variables with more than two levels handled in Stata?
* By creating dummy (indicator) variables for each category or * by using **i.**prefix to tell Stata to treat the variable as categorical.
69
What does the i. prefix in Stata do?
Automatically creates dummy variables and selects one category as the baseline (reference).
70
How are variables with more than two categories typically coded?
Numerically coded, e.g. 0, 1, 2, 3 or 1, 2, 3, 4, representing each category.
71
What does the lowest coded value usually represent?
Baseline (reference) category — often unexposed group.
72
What are the two main ways to enter multi-category variables into a regression model?
1️⃣ As a dummy numeric variable 2️⃣ As a categorical variable
73
When a variable is entered as a dummy numeric variable, what does its β coefficient represent? | MLR
Mean difference (increase or decrease) in DV (Y) for each 1-category increase in that predictor.
74
When a variable is entered as a categorical variable, what does its β coefficient represent? | MLR
Mean difference (increase or decrease) in Y for each non-baseline group compared to the baseline group.
75
Which variables to include in clinical trials?
Pre-specified variables only
76
Which variables to include in exploratory analysis?
Try multiple possibilities, including transformed ones.
77
Which two criteria are used for comparing models?
* AIC (Akaike Information Criterion) * BIC (Bayesian Information Criterion).
78
What indicates a better model?
Lower AIC and/or BIC.
79
?? STata commands for bic/aic??
fitstat, saving(m1) fitstat, using(m1)
80
What is the purpose of model comparison in multiple linear regression?
Decide which regression model fits the data best by comparing their overall performance and goodness-of-fit statistics.
81
Why might we compare different regression models?
Because adding or removing predictors can change model accuracy, interpretability, and overfitting risk.
82
What is meant by “nested models”?
Models where one ( simpler) is a subset of the other — the larger model contains all variables of the smaller model plus additional predictors.
83
What does AIC measure?
Relative quality of a model — it balances model fit and complexity (number of predictors).
84
What does BIC measure?
Similar to AIC but applies a stronger penalty for extra predictors, favoring more parsimonious (simpler) models.
85
What is parsimoniuous
Describes a model that is simple and uses the minimum number of variables necessary to achieve a good fit to the data
86
How are AIC and BIC interpreted?
Smaller AIC or BIC values indicate a better-fitting model among those compared.
87
What happens if adding a variable increases AIC/BIC?
New model fits worse once complexity is considered — the variable likely doesn’t improve model performance.
88
What happens if adding a variable decreases AIC/BIC?
New variable improves the model’s overall fit and predictive quality.
89
What Stata command displays fit statistics for a regression model?
fitstat
90
How do you save model fit statistics for later comparison?
fitstat, saving(m1)
91
How do you compare a second model to the saved one?
fitstat, using(m1)
92
What information does fitstat provide?
* R² * adjusted R² * AIC * BIC * log-likelihood * other model-fit measures useful for comparison.
93
What indicates the preferred model in Stata’s fitstat output?
The model with the lower AIC and BIC values.
94
Why is it important not to rely on R² alone for comparing models?
Because R² always increases when you add predictors — it doesn’t account for overfitting; AIC/BIC adjust for this.
95
Example of nested models?
* Model 1: regress sysbp_before weight height i.smoker * Model 2: regress sysbp_before sex age weight height i.smoker * Here, Model 1 is nested within Model 2, because all of its variables also appear in Model 2.
96
Why are nested models compared?
To see whether adding extra variables (like age and sex) significantly improves model fit — using criteria like AIC, BIC, or F-tests.
97
What does "regress" command do?
Runs larger regression model that includes all predictors.
98
What does this command do? **"fitstat, saving(m1)"**
Displays model’s fit statistics (AIC, BIC, R², etc.) & saves them under the name m1 for later comparison.
99
What does "regress..." after fitstat, saving(m1) do?
Runs smaller model — identical to the first but with one predictor removed (e.g., drop “age”).
100
What does the **"fitstat, using(m1)"** command do?
Compares new smaller model to previous (saved) larger model, showing differences in fit statistics (AIC, BIC, etc.).
101
How do you compare nested regression models in Stata?
1️⃣ Run the full model with all predictors → regress ... 2️⃣ Save its fit stats → fitstat, saving(m1) 3️⃣ Run the smaller model (remove one predictor) → regress ... 4️⃣ Compare to the full model → fitstat, using(m1) ➡ The model with the lower AIC/BIC is preferred — simpler and better fitting.
102
Does the dependent variable (Y) need to be normally distributed in multiple linear regression?
❌ No — the dependent variable itself does not need to be normally distributed. ✅ What does need to be approximately normal are the **residuals** (errors) of the model.
103
What are the four main model building procedures?
1. Forward selection 2. Backward selection 3. Stepwise selection 4. All-subset (best-subset) selection
104
How does forward selection work?
Start with no variables → add predictors one at a time → keep those that significantly improve model fit.
105
How does backward selection work?
Start with all variables → remove the least significant one each time → stop when all remaining variables are significant.
106
What is stepwise selection?
A mix of forward and backward methods — adds variables step by step and removes any that become non-significant after additions.
107
What does all-subset selection do?
Fits every possible combination of predictors and picks the model with the best fit (lowest AIC/BIC, highest adjusted R²).