Section 10: Linear Regression Analysis Flashcards

(140 cards)

1
Q

What is regression modelling?

A

statistical technique used to study the relationship between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we use regression models?

A

Helps understand how DV (outcome) changes when one or more IV (predictors) change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of regression analysis?

A
  1. Explanation – understanding relationships between variables.
  2. Prediction – forecasting future outcomes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What types of outcomes can regression modelling handle?

A
  • Continuous outcomes → Linear regression
  • Binary outcomes → Logistic regression
  • Categorical/Count/Time-based outcomes → Other regression models (e.g., Poisson, Cox).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the main goal of regression analysis?

A

To find best-fitting mathematical equation that describes the relationship between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a simple linear regression analysis model?

A

Models the linear relationship between:

  • A normally distributed numeric outcome
  • A numeric exposure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the returns of a simple linear regression analysis?

A
  • the regression equation - y=βx+constant
  • Confidence interval
  • significance (p-value)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does a simple linear regresison analysis find?

A

line that best fits the pattern of the linear relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Beta 0?

A
  • Predicted Y when X = 0 (may or may not be meaningful depending on context).
  • basically by y-intercept you find it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is B1 (regression coefficient)?

A

Estimated change (increase or decrease) in the Y variable for each 1 unit increase in the X variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Example: if β₁ = 3, what happens?

A

Y increases by 3 units for every 1-unit increase in X (on average)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the regression equation?

A

Y=β0 +β1X+ε.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the residual error (ε)?

A

Distance between actual data point & predicted value from fitted line use it when checking assumptions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does a steeper slope mean?

A

there is a stronger effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is R2?

A

Proportion of variation in Y explained by X — a measure of model fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does correlation differ from regression?

A
  • Correlation measures strength & direction of a linear relationship between two variables, without distinguishing cause and effect.
  • Regression models & predicts relationship, specifying DV (Y) and one or more IV (X).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What type of outcome is required for simple linear regression?

A

A normally distributed numeric outcome (continuous variable).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are residuals in regression analysis?

A

Differences between observed values & predicted values from regression line — they measure model’s error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does a typical regression output include?

A
  • Regression coefficients
  • standard errors
  • confidence intervals
  • p-values
  • measures of model fit such as R².
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Example: BP = 105.24 + 0.8 × BMI. What is the predicted BP for BMI = 26?

A

105.24 + 0.8 × 26 = 126.04

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is R-squared (R²) in linear regression?

A
  • The coefficient of determination
  • proportion of variance in the DV explained by the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does an R² value close to 1 indicate?

A
  • Model explains almost all variability in the response variable — an excellent fit.
  • closer R2 is to 1.00, better linear regression model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does an R² value of 0 indicate?

A

That the model explains none of the variation in the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the relationship between association & R2?

A

Stronger association between outcome & exposure = higher R2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What are the four main assumptions of linear regression?
1. Linear relationship betweeen IV & DV 2. No significant outliers 3. Normality of residuals & atleast of DV 4. Homoscedasticity of residuals
26
What happens if the relationship between 2 numeric variables is not linear?
Linear regression analysis will. provide a NON ACCURATE regression coefficients & results
27
How can you test for linearity in regression? (assumption 1)
* using scatter plots * residual-vs-fitted value plots.
28
How are outliers detected in linear regression? (assumption 2)
Using scatterplots, residual plots, Cook’s Distance.
29
What is linear regression sensitive to?
* outlier effects * need to exclude them/do sensitivity analysis/take transformations
30
How is the normality of residuals tested? (assumption 3)
* Q–Q plots * P–P plots * Shapiro–Wilk test
31
How is homoscedasticity tested?
* Using residuals-vs-fitted plots * or the Breusch–Pagan test (estat hettest). * scatterplots
32
What is homoscedasticity of residuals? | check dit!
The residuals (distance between each point and the best-fit line) are equal across the regression line
33
What can you do if regression assumptions are violated?
* Recheck data and outliers * transform variables * use bootstrapping for robust inference.
34
What should you expect to see in a residuals-vs-fitted values plot if the linearity assumption holds? | klopt dit?
Points should be randomly scattered around the horizontal zero line, without any pattern.
35
What is the purpose of the **acprplot** command in Stata?
It creates an augmented partial residual plot to help identify non-linear relationships.
36
What indicates nonlinearity in an acprplot?
Systematic curves or clear deviations from the horizontal line.
37
What is Cook’s Distance used for in regression analysis?
to test the influence of outliers
38
What do we get from Stata when we use CooksD command?
table with the influence statistic & residual for each observation
39
Why are there four options for testing normality in regression?
Because Q–Q and P–P plots can each be applied to either the original variables (X or Y) or the residuals from the regression.
40
What does a Q–Q plot of the observed variable test?
Whether the original variable (e.g., X or Y) is normally distributed before fitting the model.
41
What does a Q–Q plot with residuals test?
Whether the observed residuals follow a normal distribution
42
What does a P-P plot of the observed variable test?
Area under the curve (cumulative distribution function) of the X or Y variable against the area under the curve of the standard normal distribution
43
What does a Q–Q plot with residuals test?
Area under the curve (cumulative distribution function) of the residuals against the area under the curve of the standard normal distribution
44
What are the four options for testing normality in regression?
1. Q–Q plot with observed data 2. Q–Q plot with residuals 3. P–P plot with observed data 4. P–P plot with residuals.
45
Which plot is better for detecting tail problems (skewness, kurtosis, heavy tail)?
The Q–Q plot.
46
Which plot is better for detecting central deviations from normality?
P–P plot.
47
When is Q-Q plot better to use?
for spotting tail problems (e.g. heavy tails, skewness)
48
When is P-P plot better to use?
better for seeing central deviations from normality
49
Difference between Q–Q and P–P plots?
* Q–Q detects tail problems * P–P focuses on central deviations.
50
Are small deviations from the diagonal acceptable in normal probability plots?
Yes, especially in large samples
51
Which part of the data must be normal?
The residuals, not necessarily the dependent variable itself.
52
What are the Stata commands for Q–Q and P–P plots? | check dit
* "qnorm variable" - QQ * "pnorm variable" - PP
53
Why can the Shapiro-Wilk test be misleading?
Large samples can show significant p-values for trivial deviations.
54
What would a Q–Q plot look like if residuals are right-skewed?
The points would curve below the diagonal line on the left and above the line on the right — forming an “S” shape.
55
How do you interpret the Shapiro-Wilk test?
p ≥ 0.05 → data are normal; p < 0.05 → non-normal.
56
Stata command for a formal normality test?
"**swilk variable**" — Shapiro-Wilk test
57
What common issues cause non-normal residuals?
* Outliers * skewed data * or non-linear relationships.
58
How can you fix non-normal residuals?
* Remove outliers * transform Y (e.g., log) * use bootstrapping.
59
How can homoscedasticity be assessed visually?
* By plotting residuals against fitted values * points should be more or less equidistant from a horizontal line
60
What Stata command tests for homoscedasticity statistically?
**estat hettest** (Breusch–Pagan / Cook–Weisberg test).
61
How do you interpret the Breusch–Pagan test result (hettest)?
* p ≥ 0.05 → Constant variance (assumption holds). * p < 0.05 → Heteroscedasticity (assumption violated).
62
What is homoscedasticity in linear regression?
* residuals have constant variance across all levels of the independent variable(s). * equal distance from line of best fit
63
What is heteroscedasticity?
* When **variance of residuals changes** across the fitted values * e.g., form a funnel or U-shaped pattern in a residual plot.
64
Why is homoscedasticity important?
Coz unequal variance makes standard errors, t-tests, & confidence intervals unreliable, even if coefficients remain unbiased.
65
What should residuals look like if the assumption holds?
Randomly scattered around zero with a **roughly equal vertical spread** across all fitted values.
66
What pattern suggests heteroscedasticity?
funnel shape — residuals narrow at one end and widen at the other, or vice versa.
67
Which plot is most useful for checking homoscedasticity?
The residuals-vs-fitted-values plot.
68
What Stata command produces a residuals-vs-fitted plot?
* Stata: **rvfplot** * It displays residuals (Y-axis) vs. fitted values (X-axis).
69
What formal test checks for heteroscedasticity in Stata? | is this correct? isnt it supposed to be homoscedasticity?
The Breusch–Pagan / Cook–Weisberg test, using **estat hettest**
70
Can heteroscedasticity affect coefficient estimates?
No, coefficients stay unbiased, but inference (SEs, p-values, CIs) can be incorrect.
71
What causes heteroscedasticity?
* Outliers * skewed variables * incorrect functional form * subgroups with different variability.
72
How does a residual plot look when homoscedasticity holds?
Points form a horizontal cloud around zero, evenly spread along X.
73
How can heteroscedasticity be fixed?
1. Transform the dependent variable (e.g., log, sqrt) 2. Use robust standard errors 3. Remove or model outliers.
74
What happens if assumptions are violated for linear regression?
* not a reason to reject whole analysis * Provided sample size is large, results may can be robust to violation of assumptions
75
What should you do first when regression assumptions are violated?
* **Check for data errors** & **remove influential outliers** * then repeat analysis to see if results change.
76
What can you do if the linearity or normality assumptions are violated?
Apply **data transformations** (e.g., log or square root) to improve linearity & normalize residuals.
77
What method can be used if assumptions remain violated after transformations?
Use **bootstrapping** to obtain assumption-independent confidence intervals and robust standard errors.
78
✅ What are the three main actions to take when regression assumptions are violated?
1️⃣**Check and clean data** : correct errors and remove influential outliers. 2️⃣ **Transform variables** : fix non-linearity or non-normality (e.g., log, square root). 3️⃣ **Use bootstrapping**: obtain robust, assumption-free confidence intervals.
79
What is data transformation?
transform data to different scale of measurement
80
What can transformation address?
assumption violation (specifically linearity assumption)
81
When is a log transformation useful?
When data are positively skewed or variance increases with fitted values.
82
What is logarithmic transofrmation?
* most common * only used with positive values
83
What’s the formula for a log transformation?
* u=log10​(x) * u=ln(x).
84
How do you interpret log-transformed regression results?
Exponentiate coefficients: exp(β) gives multiplicative (percentage) change in Y for a one-unit change in X.
85
What is bootstrapping?
Resampling method that takes repeated samples with replacement to estimate standard errors & confidence intervals.
86
Why use bootstrapping?
It provides reliable conclusions without relying on normality or homoscedasticity assumptions.
87
Can categorical exposure variables be used in linear regression?
Yes, if coded numerically (e.g., 0 = reference, 1 = exposed). | for binary exposure variables (yes & no or exposed & unexposed)
88
How is a binary exposure variable interpreted?
β coefficient shows the mean difference in Y between the two groups.
89
What does the intercept represent in a binary model?
Mean value of Y for the reference (0) group.
90
How are multi-category variables handled in Stata?
Using dummy coding or **i.variable** notation.
91
How are categorical variables with more than two categories coded for regression?
Numerically coded — e.g., 0, 1, 2, 3 or 1, 2, 3, 4 — where the lowest value represents reference (unexposed) category.
92
What are the two main ways to include multi-category variables in a regression model?
1️⃣ As a numeric variable (dummy numeric entry). 2️⃣ As a categorical variable (with i. prefix in Stata).
93
What happens if a multi-category variable is entered as a numeric variable in Stata (just the variable name)?
Stata treats it as a continuous variable, & β coefficient represents mean change in Y for each one-category increase in X.
94
What happens if a multi-category variable is entered as a **categorical variable** in Stata (with i. before it)?
Each category is compared separately to the **baseline (reference) group** & each **β coefficient** represents mean difference in Y between that group & the baseline.
95
How do you interpret β coefficients when using i.variable?
Each β shows how much higher or lower the **mean Y** is for that category compared to the **reference (baseline)** category.
96
What is a dummy variable in regression?
* A **binary variable** coded as 0 or 1 that represents a specific category of a categorical variable * (e.g., 1 = exposed, 0 = not exposed).
97
Why do we use dummy variables?
To include categorical variables in regression models by converting categories into **numeric indicators** that can be interpreted statistically.
98
How is the β coefficient interpreted for a dummy variable?
It represents **mean difference** in DV (Y) between group coded as **1** and the **reference group coded as 0.**
99
Example: If smoking (1 = smoker, 0 = non-smoker) and β = 5, how is this interpreted?
On average, smokers have Y values 5 units higher than non-smokers.
100
When might you use a t-test or ANOVA instead of regression?
When comparing group means and regression assumptions are hard to meet.
101
What is the first step in simple linear regression?
**Data visualization** — examine the data before modelling.
102
How do you visualize the dependent variable before regression? | Step 1 (data visualisation) of simple linear regression
Use a **histogram** to check its distribution.
103
How do you visualize the relationship between variables? | Step 1 (data visualisation) of simple linear regression
Draw a scatter plot with DV on Y-axis & IV on X-axis.
104
What Stata command creates a histogram of a variable? | for step 1 (data visualisation) of simple linear regression
**hist var_name**
105
What Stata command creates a scatter plot between two variables? | for step 1 (data visualisation) of simple linear regression
* **scatter var1 var2** * var1 = dependent variable * var2 = independent variable.
106
What is the second step in simple linear regression?
* Fit the regression model & * interpret results **(regression coefficients, 95% CI, p-values, R²).**
107
What Stata command fits a simple linear regression?
* **regress var1 var2** * var1 = dependent variable * var2 = independent variable.
108
How can you generate residuals in Stata after regression?
* **predict residuals if e(sample), resid** * Creates new variable containing the model’s residuals.
109
What does “residuals” mean in regression output?
**differences between observed & predicted values** — they show how well the model fits the data.
110
What does it mean if residuals are large?
Model **doesn't describe relationship well** — predictions are far from the observed values.
111
What does a low R² value indicate?
The model explains little of Y’s variation — the relationship may be weak or other predictors are missing.
112
How do you interpret an R² value of 0.80?
80% of the variability in Y is explained by the independent variable X — the model fits the data well
113
What does a negative β coefficient mean?
As X increases, Y decreases — there is a negative relationship.
114
What does a positive β coefficient mean?
As X increases, Y increases — there is a positive relationship.
115
What does the p-value indicate in regression output?
It tests whether the β coefficient is significantly different from zero (i.e., whether X is related to Y).
116
How can you check for a linear relationship in Stata?
Use the **augmented component-plus-residual plot** **(acprplot)**.
117
What Stata command checks the linearity assumption?
* acprplot var1, lowess * var1 = IV
118
What does the lowess option in acprplot do?
It adds a smoothed curve to help visualize whether the relationship is approximately linear.
119
How do you interpret an acprplot?
* **Straight line** → linear relationship holds. * **Curved line** → relationship is non-linear (assumption violated).
120
What Stata command creates a residual-vs-fitted plot to check for outliers?
* **rvfplot, mlabel(id)** * mlabel – enter variable that corresponds to patient id * It plots residuals against fitted values and labels points using the ID variable. | i dont get mlabel u have to put ur own or type it in stata like this?
121
What does the mlabel(id) option do in rvfplot?
It labels data points with their ID numbers so potential outliers can be identified visually. | i dont get mlabel u have to put ur own or type it in stata like this?
122
How can you view the characteristics of a possible outlier in Stata?
* **list var1 var2 if id==10** * This lists selected patient’s values (e.g., age, weight, height). * var1 var 2 correspond to patient characteristics (age, weight, height, etc) | 10 i think is the id number!
123
What Stata command calculates Cook’s Distance values?
* **predict cook if e(sample), cooksd** * Creates a new variable (cook) storing Cook’s Distance for each observation.
124
What does Cook’s Distance measure?
Influence of each observation on regression model - how much the fitted values would change if that case were removed.
125
What command lists cases with large Cook’s Distance values?
* **list id if cook > 4/230** * This lists IDs with Cook’s Distance > 4/n, where n is the sample size.
126
What does it mean if Cook’s Distance is high?
Observation has strong influence on model & should be checked for data entry errors or special causes.
127
What should you do if you find influential outliers?
* Verify accuracy * justify inclusion/exclusion * repeat analysis to see if results change.
128
Why is normality of residuals important?
Ensure p-values, confidence intervals, & hypothesis tests from model are valid.
129
What Stata command checks normality using a Q–Q plot?
* **qnorm residuals** * Plots residuals against a normal distribution.
130
What does it mean if points in a Q–Q plot fall roughly along a straight line?
The residuals are **approximately normal** — the assumption holds.
131
What does it mean if points curve away from the diagonal in a Q–Q plot?
The residuals are **not normally distributed** — the assumption is violated.
132
What statistical test can be used to formally test residual normality in Stata?
* **swilk residuals** * This runs Shapiro–Wilk test for normality.
133
How do you interpret the Shapiro–Wilk test result?
* **p ≥ 0.05** → residuals are **normal** (assumption met) * **p < 0.05**→ residuals are **not normal** (assumption violated).
134
135
What Stata command checks for homoscedasticity visually?
* **rvfplot, yline(0)** * plots residuals vs fitted values & adds red line at 0 for easier interpretation.
136
What should a residuals-vs-fitted plot look like if the assumption is met?
Points evenly scattered around zero line with no pattern or change in spread.
137
What Stata command formally tests for heteroscedasticity?
**estat hettest**
138
How do you interpret the **estat hettest** result?
* **p ≥ 0.05** → No heteroscedasticity (good — **assumption met**) * **p < 0.05** → Heteroscedasticity present (**assumption violated**).
139
What happens if the assumptions are met for simple linear regression?
results of linear regression are **valid**
140
What happens if the assumptions are NOT met for simple linear regression?
* results of linear regression are **NOT valid** * consider transformations, removing outliers, go back to step 1