Linear Regression Flashcards

(62 cards)

1
Q

What are the purposes of regression analysis?

A
  • To estimate means for different levels of predictor variable X
  • To generate predictions for new cases, based on the predictor variable X
  • To test hypotheses about the association between X and Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What research question are we addressing with a two-sample t-test?

A
  • Whether two means (e.g. mean BMI) are statistically significantly different in the two groups (e.g. individuals with and without diabetes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the assumptions of independent samples t-tests?

A
  • Normality: If each sample is sufficiently large (over 30), then its sample mean is approximately normal by the CLT, or examine a histogram for each sample
  • Homogeneity of variance: Apply Levene’s test for equality of variances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the population regression model and what are its components?

A
  • Y= β0 + β1x + ϵ
  • β0 + β1x is the systematic (explained) variation)
  • ϵ is the random (unexplained) variation
  • Y is a continuous variable for a simple or multiple linear regression model
  • X is either a continuous or a categorical variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is ϵ?

A
  • A random variable
  • Often called the ‘error term’ or residual
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the mean and SD of ϵ?

A

Mean 0 and SD σ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can we write the estimated regression line and what are its components?

A
  • ŷ= b0 + b1x
  • b0 + b1 are estimated values of β0 + β1, respectively
  • Image shows how this equation can also be written using hat notation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is b0 and b1 relative to β0 and β1?

A

b0 and b1 are estimated values of β0 and β1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A model is looking at BMI based on diabetes status. Write the regression model

A
  • Y = β0 + β1XDiab + ϵ
  • BMI = B0 + B1XDiab + ϵ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In a model looking at BMI based on diabetes status, the estimates for β0 and β1 are 26.6 and 3.2, respectively. What is the interpretation?

A
  • b0 = 26.6, the average BMI for the group without diabetes
  • b1 = 3.2, the difference in BMI between the groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In a model looking at BMI based on diabetes status, the estimates for β0 and β1 are 26.6 and 3.2, respectively. What is the estimated regression line?

A
  • ŷ = bo + b1XDiab
  • ŷ = 26.6 + 3.2XDiab
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we measure the ‘distance’ between the observed data and the fitted model?

A

In terms of residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the residual in terms of the equation Yi = b0 + b1Xi?

A
  • For each data point, is the difference between the observed value Yi and the fitted value Yi.
  • The residual is the vertical distance between a point on the scatterplot and the fitted value on the regression line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the role of the regression line?

A
  • It estimates the average values for the dependent variable corresponding to each value of the independent variable
  • For a scatter diagram, the regression line serves the role that an average does for a single variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the method of least squares?

A
  • A method that measures the ‘distance’ between the model and the observed data in terms of the sum of squared residuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the ‘best’ or ‘least squares’ regression line?

A
  • The line that minimizes the sum of squared residuals for all the points in the plot
  • The regression parameter estimates b0 and b1 that yield the smallest possible sum of squared residuals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the total sum of squares (TSS)?

A
  • The total variability of the outcome about its mean
  • It has two components
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the two components of TSS?

A
  • MSS
  • RSS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is MSS?

A
  • Model sum of squares
  • The variability of the outcome about its mean that is accounted for by the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is RSS?

A
  • Residual sum of squares
  • The variability of the outcome about its mean that cannot be accounted for by the model
  • Also known as Sum of Squares due to Error (SSE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the relationship between TSS, MSS, and RSS?

A

TSS = MSS + RSS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can RSS be minimized?

A
  • It is, by definition, minimized by the regression line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the least squares (ls) solution?

A
  • The choice of b0 and b1 that minimizes the RSS
  • It gives the smallest possible RSS given the sample data
  • So, b0 and b1 are called the least squares estimators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What level of SSE is an indicator of good and poor fit
- If SSE = 0, then the line is perfect - "Large" SSE is an indicator of poor fit
26
What is the df (degrees of freedom) of the t-statistic, and what do the letters mean?
- df = n - p - 1, where n is the number of cases and p is number of predictors in the model
27
What summary statistics can be used for regression models?
- Model coefficients (slope, intercept) - Residual standard deviation - Multiple correlation coefficient R2 and adjusted R2 - F-statistic - Model degrees of freedom
28
What is the F-statistic used for?
- Asks the question: overall, how well does the model explain or predict Y - It’s used as an overall test of the model to assess whether the predictors in the model, considered as a group, are associated with the response
29
What are the null and alternative hypotheses for the F-statistic?
- H0 = β1 = β2 = ……. = βp = 0 - HA = At least one of the slope coefficients is not 0
30
What is the residual variance?
- An estimate of σ2. It’s a measure of unexplained variation in the y variable and is calculated as the average of the squared residuals - In R output, the σ hat is defined as the residual standard error (RSE)
31
How is σ hat defined in R output?
As the residual standard error (RSE)
32
What is R^2?
- Multiple correlation coefficient It has many interpretations: - The square of correlation between observed and predicted values - The proportion of variation explained by the fitted model
33
What type of regression is R^2 used for?
Simple linear, not multiple linear regression
34
What level of R^2 indicates good and poor fit
If a linear model perfectly captured the variability in the observed data, then R2 would be 1
35
What is adjusted R^2?
- A measure that incorporates a penalty for including predictors that do not contribute much towards explaining the observed variation in the response variable - It is often used to balance predictive ability with model complexity - Unlike R2, R2adj does not have an inherent interpretation
36
In general, what level of R2 and σ hat do you want to say that the model is ‘good’ or that ‘X explains the variation in Y well’?
- We want a high R2 and low σ
37
What is the L assumption?
- Linearity - The relationship between the response variable and the predictor is linear - This can also be determined by calculating the mean of errors, which ideally should be 0
38
What is the I assumption?
- Independence - Observations are independent of each other, and thus errors are independent
39
What is the N assumption?
- Normality - Errors appear normally distributed - This is the least important assumption due to the CLT
40
What is the E assumption?
- Equal variances or homoscedasticity - Constant variance of errors/residuals across all levels of the predictor variables
41
How do you assess the linear assumption?
- Using scatter plots, boxplots, and q-q plots - Image shows how you can interpret linearity from scatter plots
42
How do you assess the independence assumption?
- Using statistical tests - Inferring from information provided
43
How do you assess the normality assumption?
- Using a q-q plot - Using histograms - Using statistical tests
44
How can you use a q-q plot to evaluate the normality assumption?
- Look at whether the plot is linear
45
When using statistical tests to assess the normality assumption, what are the null and alternative hypotheses?
- H0: errors/residuals are normally distributed (no difference between data and normal curve) - HA: there is a difference, or not normally distributed
46
How do you assess the equal variances/homoscedasticity assumption?
- Using plots of residual vs. predictor - Or plots of residuals vs. fitted values
47
What are O and M?
- nO influential outliers - No strong Multicolinearity
48
What is an outlier?
- A point with a very large residual
49
Graphs showing outlier vs. leverage vs. influence
- Outlier: large residual (large value in X-Y direction) - Leverage: outside the typical range of X values - Influence: influential on model estimates
50
What is the effect of high leverage points?
- We want good coverage for a range of x values to avoid extrapolating too much - High leverage points have the potential to exert influence on estimated coefficients
51
What is influence?
- The actual influence a point has on the estimated coefficients - A point may have high leverage but low influence (or vice versa) - You can determine this by removing a point fro the data and looking at how the model changes (DFBETA)
52
What is Box-Cox transformation and what is its purpose?
- A popular way to automatically find a good transformation on the outcome variable y - Objective: to find the best exponent λ for the transformation of y in into y^λ - It is designed for a strictly positive y outcome variable and chooses the transformation to find the best fit to the data
53
Example of Box-Cox transformation
54
What are reasons to do a log transformation?
- Transformation of predictors to achieve linearity (L) - Transforming of outcome to normalize residuals (N) - Transformation of outcome to stabilize variance (E)
55
Log transformation of predictor
56
Log transformation of the outcome
57
Log transformation of both the predictor and outcome
58
When transforming the predictor with polynomials, how do you choose the polynomial degree (d)?
- Add one higher-order term at a time until the added term is not statistically significant - Start with a large d and eliminate terms starting with the highest order terms, and eliminate the non-statistically significant terms
59
What are the two major pitfalls in MLR?
- Collinearity - Overfitting
60
What is collinearity, and how does it arise during MLR?
- Collinearity arises when two or more predictors measuring similar things (e.g. BMI and weight) are both included in an MLR - Essentially, you can get one predictor that is a linear combination of other predictors - Estimated regression coefficients become unstable and change dramatically - Standard errors for regression coefficients 'blow up'
61
How can you detect collinarity?
- Assess the correlation matrix of predictors - Check R2 for a regression of a predictor X1, with all other predictors and repeat the process. If it's close to 1, this is a problem because this means that one predictors can be a linear combination of other predictors
62
What is overfitting, and what are its effects?
- In multivariable modeling, you get highly significant but meaningless results if you keep adding predictors - The model is fit perfectly to the quirks of your particular sample, but you have no predictive ability in a new sample