Section 10: Linear Regression Analysis Flashcards

Question

What are the four main assumptions of linear regression?

Answer 1

1. Linear relationship betweeen IV & DV 2. No significant outliers 3. Normality of residuals & atleast of DV 4. Homoscedasticity of residuals

Answer 2

Linear regression analysis will. provide a NON ACCURATE regression coefficients & results

Answer 3

* using scatter plots * residual-vs-fitted value plots.

Answer 4

Using scatterplots, residual plots, Cook’s Distance.

Answer 5

* outlier effects * need to exclude them/do sensitivity analysis/take transformations

Answer 6

* Q–Q plots * P–P plots * Shapiro–Wilk test

Answer 7

* Using residuals-vs-fitted plots * or the Breusch–Pagan test (estat hettest). * scatterplots

Answer 8

The residuals (distance between each point and the best-fit line) are equal across the regression line

Answer 9

* Recheck data and outliers * transform variables * use bootstrapping for robust inference.

Answer 10

Points should be randomly scattered around the horizontal zero line, without any pattern.

Answer 11

It creates an augmented partial residual plot to help identify non-linear relationships.

Answer 12

Systematic curves or clear deviations from the horizontal line.

Answer 13

to test the influence of outliers

Answer 14

table with the influence statistic & residual for each observation

Answer 15

Because Q–Q and P–P plots can each be applied to either the original variables (X or Y) or the residuals from the regression.

Answer 16

Whether the original variable (e.g., X or Y) is normally distributed before fitting the model.

Answer 17

Whether the observed residuals follow a normal distribution

Answer 18

Area under the curve (cumulative distribution function) of the X or Y variable against the area under the curve of the standard normal distribution

Answer 19

Area under the curve (cumulative distribution function) of the residuals against the area under the curve of the standard normal distribution

Answer 20

1. Q–Q plot with observed data 2. Q–Q plot with residuals 3. P–P plot with observed data 4. P–P plot with residuals.

Answer 21

The Q–Q plot.

Answer 22

P–P plot.

Answer 23

for spotting tail problems (e.g. heavy tails, skewness)

Answer 24

better for seeing central deviations from normality

Answer 25

* Q–Q detects tail problems * P–P focuses on central deviations.

Answer 26

Yes, especially in large samples

Answer 27

The residuals, not necessarily the dependent variable itself.

Answer 28

* "qnorm variable" - QQ * "pnorm variable" - PP

Answer 29

Large samples can show significant p-values for trivial deviations.

Answer 30

The points would curve below the diagonal line on the left and above the line on the right — forming an “S” shape.

Answer 31

p ≥ 0.05 → data are normal; p < 0.05 → non-normal.

Answer 32

"**swilk variable**" — Shapiro-Wilk test

Answer 33

* Outliers * skewed data * or non-linear relationships.

Answer 34

* Remove outliers * transform Y (e.g., log) * use bootstrapping.

Answer 35

* By plotting residuals against fitted values * points should be more or less equidistant from a horizontal line

Answer 36

**estat hettest** (Breusch–Pagan / Cook–Weisberg test).

Answer 37

* p ≥ 0.05 → Constant variance (assumption holds). * p < 0.05 → Heteroscedasticity (assumption violated).

Answer 38

* residuals have constant variance across all levels of the independent variable(s). * equal distance from line of best fit

Answer 39

* When **variance of residuals changes** across the fitted values * e.g., form a funnel or U-shaped pattern in a residual plot.

Answer 40

Coz unequal variance makes standard errors, t-tests, & confidence intervals unreliable, even if coefficients remain unbiased.

Answer 41

Randomly scattered around zero with a **roughly equal vertical spread** across all fitted values.

Answer 42

funnel shape — residuals narrow at one end and widen at the other, or vice versa.

Answer 43

The residuals-vs-fitted-values plot.

Answer 44

* Stata: **rvfplot** * It displays residuals (Y-axis) vs. fitted values (X-axis).

Answer 45

The Breusch–Pagan / Cook–Weisberg test, using **estat hettest**

Answer 46

No, coefficients stay unbiased, but inference (SEs, p-values, CIs) can be incorrect.

Answer 47

* Outliers * skewed variables * incorrect functional form * subgroups with different variability.

Answer 48

Points form a horizontal cloud around zero, evenly spread along X.

Answer 49

1. Transform the dependent variable (e.g., log, sqrt) 2. Use robust standard errors 3. Remove or model outliers.

Answer 50

* not a reason to reject whole analysis * Provided sample size is large, results may can be robust to violation of assumptions

Answer 51

* **Check for data errors** & **remove influential outliers** * then repeat analysis to see if results change.

Answer 52

Apply **data transformations** (e.g., log or square root) to improve linearity & normalize residuals.

Answer 53

Use **bootstrapping** to obtain assumption-independent confidence intervals and robust standard errors.

Answer 54

1️⃣**Check and clean data** : correct errors and remove influential outliers. 2️⃣ **Transform variables** : fix non-linearity or non-normality (e.g., log, square root). 3️⃣ **Use bootstrapping**: obtain robust, assumption-free confidence intervals.

Answer 55

transform data to different scale of measurement

Answer 56

assumption violation (specifically linearity assumption)

Answer 57

When data are positively skewed or variance increases with fitted values.

Answer 58

* most common * only used with positive values

Answer 59

* u=log10(x) * u=ln(x).

Answer 60

Exponentiate coefficients: exp(β) gives multiplicative (percentage) change in Y for a one-unit change in X.

Answer 61

Resampling method that takes repeated samples with replacement to estimate standard errors & confidence intervals.

Answer 62

It provides reliable conclusions without relying on normality or homoscedasticity assumptions.

Answer 63

Yes, if coded numerically (e.g., 0 = reference, 1 = exposed). | for binary exposure variables (yes & no or exposed & unexposed)

Answer 64

β coefficient shows the mean difference in Y between the two groups.

Answer 65

Mean value of Y for the reference (0) group.

Answer 66

Using dummy coding or **i.variable** notation.

Answer 67

Numerically coded — e.g., 0, 1, 2, 3 or 1, 2, 3, 4 — where the lowest value represents reference (unexposed) category.

Answer 68

1️⃣ As a numeric variable (dummy numeric entry). 2️⃣ As a categorical variable (with i. prefix in Stata).

Answer 69

Stata treats it as a continuous variable, & β coefficient represents mean change in Y for each one-category increase in X.

Answer 70

Each category is compared separately to the **baseline (reference) group** & each **β coefficient** represents mean difference in Y between that group & the baseline.

Answer 71

Each β shows how much higher or lower the **mean Y** is for that category compared to the **reference (baseline)** category.

Answer 72

* A **binary variable** coded as 0 or 1 that represents a specific category of a categorical variable * (e.g., 1 = exposed, 0 = not exposed).

Answer 73

To include categorical variables in regression models by converting categories into **numeric indicators** that can be interpreted statistically.

Answer 74

It represents **mean difference** in DV (Y) between group coded as **1** and the **reference group coded as 0.**

Answer 75

On average, smokers have Y values 5 units higher than non-smokers.

Answer 76

When comparing group means and regression assumptions are hard to meet.

Answer 77

**Data visualization** — examine the data before modelling.

Answer 78

Use a **histogram** to check its distribution.

Answer 79

Draw a scatter plot with DV on Y-axis & IV on X-axis.

Answer 80

**hist var_name**

Answer 81

* **scatter var1 var2** * var1 = dependent variable * var2 = independent variable.

Answer 82

* Fit the regression model & * interpret results **(regression coefficients, 95% CI, p-values, R²).**

Answer 83

* **regress var1 var2** * var1 = dependent variable * var2 = independent variable.

Answer 84

* **predict residuals if e(sample), resid** * Creates new variable containing the model’s residuals.

Answer 85

**differences between observed & predicted values** — they show how well the model fits the data.

Answer 86

Model **doesn't describe relationship well** — predictions are far from the observed values.

Answer 87

The model explains little of Y’s variation — the relationship may be weak or other predictors are missing.

Answer 88

80% of the variability in Y is explained by the independent variable X — the model fits the data well

Answer 89

As X increases, Y decreases — there is a negative relationship.

Answer 90

As X increases, Y increases — there is a positive relationship.

Answer 91

It tests whether the β coefficient is significantly different from zero (i.e., whether X is related to Y).

Answer 92

Use the **augmented component-plus-residual plot** **(acprplot)**.

Answer 93

* acprplot var1, lowess * var1 = IV

Answer 94

It adds a smoothed curve to help visualize whether the relationship is approximately linear.

Answer 95

* **Straight line** → linear relationship holds. * **Curved line** → relationship is non-linear (assumption violated).

Answer 96

* **rvfplot, mlabel(id)** * mlabel – enter variable that corresponds to patient id * It plots residuals against fitted values and labels points using the ID variable. | i dont get mlabel u have to put ur own or type it in stata like this?

Answer 97

It labels data points with their ID numbers so potential outliers can be identified visually. | i dont get mlabel u have to put ur own or type it in stata like this?

Answer 98

* **list var1 var2 if id==10** * This lists selected patient’s values (e.g., age, weight, height). * var1 var 2 correspond to patient characteristics (age, weight, height, etc) | 10 i think is the id number!

Answer 99

* **predict cook if e(sample), cooksd** * Creates a new variable (cook) storing Cook’s Distance for each observation.

Answer 100

Influence of each observation on regression model - how much the fitted values would change if that case were removed.

Answer 101

* **list id if cook > 4/230** * This lists IDs with Cook’s Distance > 4/n, where n is the sample size.

Answer 102

Observation has strong influence on model & should be checked for data entry errors or special causes.

Answer 103

* Verify accuracy * justify inclusion/exclusion * repeat analysis to see if results change.

Answer 104

Ensure p-values, confidence intervals, & hypothesis tests from model are valid.

Answer 105

* **qnorm residuals** * Plots residuals against a normal distribution.

Answer 106

The residuals are **approximately normal** — the assumption holds.

Answer 107

The residuals are **not normally distributed** — the assumption is violated.

Answer 108

* **swilk residuals** * This runs Shapiro–Wilk test for normality.

Answer 109

* **p ≥ 0.05** → residuals are **normal** (assumption met) * **p < 0.05**→ residuals are **not normal** (assumption violated).

Answer 110

* **rvfplot, yline(0)** * plots residuals vs fitted values & adds red line at 0 for easier interpretation.

Answer 111

Points evenly scattered around zero line with no pattern or change in spread.

Answer 112

**estat hettest**

Answer 113

* **p ≥ 0.05** → No heteroscedasticity (good — **assumption met**) * **p < 0.05** → Heteroscedasticity present (**assumption violated**).

Answer 114

results of linear regression are **valid**

Answer 115

* results of linear regression are **NOT valid** * consider transformations, removing outliers, go back to step 1

Section 10: Linear Regression Analysis Flashcards

(140 cards)