Lecture 18: Regression: Flashcards

(57 cards)

1
Q

What type of variables are seen in regression?

A

Regression: Continuous dependent and independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What’s the difference between QUALITATIVELY and QUANTITAVELY?

A

Qualitative information describes qualities, characteristics, or experiences using words.

Quantitative information measures, counts, or quantifies using numbers and statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does a simple linear regression describe?

A

Simple linear regression describes the linear relationship between a predictor variable, plotted on the x-axis (distance from East Africa) , and a response variable, plotted on the y-axis (genetic diversity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What variable regresses on the other?

A

We say “regress Y on X”

Ex: “regress genetic diversity on distance from Africa”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the observed value?

A

The dots on the scatter plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the line that follows the general trend of the scatter plot called the fitted regression line?

A

Fitted regression line representing predicted values for any given value of X (proportion black). Best fit of the data → aid in making predictions in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the fitted regression line aid in?

A

aid in making predictions in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the predicted value?

A

Values along the fitted regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Residual value e?

A
  • The difference (deviation) between the observed and predicted values.
  • Can have positive and negative residuals!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What algorithm does a regression model use and what does it assure?

A

A regression model uses an algorithm called “ordinary least squares (OLS)” that assures that the residual (deviation) values are as small as possible given the data. In other words, OLS maximizes the predicted values to be as closest as possible (in average) to the predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The regression line through a scatter of points is described by the following equation:

A

Y = a + bx

1) 𝒀 is referred as response variable (or also dependent variable).
2) 𝑿 is referred as explanatory variable
3) a = intercept: The predicted value of Y when X is zero (unit is the same as in Y)
4) b = slope: the rate of change in y as x changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why must you be careful when interpreting the intercept of a regression line?

A

A meaningful interpretation is only possible if X can truly be zero AND if the data include values close to zero (not the case here) → this is an issue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the unit of the intercept of the regression line?

A

The unit attached to the intercept is the same as the response variable (i.e., years).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is the intercept is useful for prediction?

A

Because it represents the addition (offset) required to correctly position the regression line so that predictions match the observed data.

–> Different intercepts would lead to predicted values that are either too high or too low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define the slope

A

Because X is expressed in proportions (i.e., 0 to 1), then the slope is the increase of the response variable (age) when the predictor increases 100%, i.e., when X = 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does Ŷ (y hat) stand for?

A

Predicted values on the regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are residuals 𝜀?

A

Residual values 𝜀 are the difference (deviation) between the observed and predicted values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or false: Each observation in the data has a predicted & residual value

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of the OLS or ordinary least squares?

A

Trying to minimize the sum of the squares of the residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the aim of a regression line?

A

A regression model aims at predicting the average Y based on X, i.e., predict the average male lion based on their proportion of black spots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does the line of best fit for a regression line minimize?

A

The line of best fit minimizes the average distance between data and fitted line, i.e., the residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do we find the best line of a regression?

A

To find the best line, we must minimise the sum of the squares of the residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why does the model to minimise the sum of the squares of the residuals to find the line of best fit use squares instead of square root?

A

Use squares such that it becomes a variance equation and can use the F-distribution

24
Q

What is the H0 and HA of a t-test in statistical hypothesis testing of a regression model?

A

1) H0: the statistical population slope 𝛽 = 0 (i.e., Y can’t be predicted by X).
2) HA: the population slope 𝛽 ≠ 0 (i.e., Y can be predicted by X).

25
When is regression significant?
Regression is significant when the slope is different from zero
26
As with any estimate based on sample data, slopes can differ from zero even when the true population slope is zero, simply due to _____
sampling variation.
27
How is regression tested using something similar to a one sample t-test?
The regression slope b divided by its standard error can be used to test the null hypothesis that 𝛽 = 0. This is similar to the one-sample t-test: --> Dividing data by the standard error becomes t-distributed
28
If P < 0.05 for a regression, what does this mean?
reject the H0 and conclude that the regression model can predict Y values.
29
What occurs when there's an increase in residuals?
Increase in residuals have more variation leading to greater standard error and a smaller t-value such that there’s less statistical power to reject the null hypothesis
30
What does statistical testing of regression depend on?
Both the slope and the residual variation
31
What is measured using the quantity called the “coefficient of determination” or the “famous” R^2?
We can measure the fraction of variation in Y (age) that is “explained” by X in the estimated linear regression model using a quantity called “coefficient of determination” or the “famous” R2:
32
What is the formula for R^2?
Sum of the squares of regression/Sum of the total
33
What is the Sum of the total in R^2?
The maximum amount of variation in Y that could be explained by any linear regression model is the total sum-of-squares of Y
34
What is the sum of the squares of regression in R^2 and what is it a measure of?
The amount of variation in Y (age) that the regression model with proportion of X (black spots) as a predictor is the regression sum-of-squares The predicted values - the mean of the observed values → a measure of quality.
35
What does it mean when R^2 is = to 0.6238 for example?
- We state then that the regression model explains 62.38% of the total variation in Y. Ex: 62% of the age of lions can be predicted using the proportion of black spots of the noses of male lions.
36
Do regression of Y on x always imply dependency?
NO: SPURIOUS CORRELATION: --> correlation between 2 variables having no causal relation.
37
Define SPURIOUS CORRELATION
Correlation between 2 variables having no causal relation.
38
What are confidence bands for regression lines?
Confidence bands describe the uncertainty around the mean relationship between X and Y.
39
What does it mean when we are 95% confident that the slope will fit within this interval for a regression line?
Because individual data points include residual variation, many observed values will naturally fall outside the confidence bands. We are interested in the confidence of the slope
40
What is the prediction interval for a regression line?
Prediction bands describe the uncertainty around individual observations, so they are always wider because they include both uncertainty in the mean and residual variation.
41
Confidence versus prediction intervals:
1) Confidence bands describe the uncertainty around the mean relationship between X and Y. 2) Prediction bands describe the uncertainty around individual observations, so they are always wider because they include both uncertainty in the mean and residual variation.
42
What are the issues involving extrapolation in regression lines: predicting Y for X-values beyond the range of the data:
- Ear length= 55.9+0.22(age). Our ears grow longer, about 0.22mm per year. - The intercept here predicts ear length at birth (X=0 years); a baby does not have ears of 56mm (i.e., 5.6cm)!! - Slope is changing and at birth, the predictions do not hold as babies do not have ears of 5.6cm --> So, predictions hold well within the range of X values but not outside.
43
Define extrapolation?
Extrapolation: the action of estimating or concluding something by assuming that existing trends will continue or a current method will remain applicable.
44
What must you ensure about the predictor value of a regression line?
Ensure that the distribution of predictor value is approximately uniform within the sampled range: the standard error cannot tell you that: - Uniform distribution of the predictor variable must occur or else it indicates biased sampling leading to a Type I error
45
What are the 6 assumptions of regression models?
1) Linearity; It is critical to graph the data 2) All observations have similar influences on the regression model 3) Residual variation is normally distributed; that is where the confidence interval comes from 4) Residual variation is homoscedastic (constant across the range of X values) 5) Values of X (predictor) is measured without error (hard to assess, often assumed) 6) Residuals are independent: this is the assumption in which data are sampled randomly
46
What is the linearity assumption of the regression model?
- Plotting the residuals against predictor values is critical in assessing whether a linear model is appropriate. - The horizontal line is the average of residuals (which is always zero as a result of the fitting method). - If variance is greater in different parts of the line, this indicates lack of linearity or heteroscedasticity --> dots should be mostly clustered around the line
47
What is the assumption that all observations must have similar influences on the regression model?
Francis Anscombe’s quartets: Quartet 1 is the only appropriate in the sense that all observations have the same influence on the model, i.e., removal of one observation won’t affect the model much. There are different methods to estimate the influence of each observation on the model (advanced level).
48
What are Francis Anscombe’s quartets?
It comprises four data sets that have nearly identical simple descriptive statistics and regression models. Yet, they have very different distributions and appear very different when graphed. These data demonstrate both the importance of graphing data before analyzing it and the influence of influential observations (outliers). --> All Quartets have the same regression model and R^2
49
Which quartet is appropriate for a regression model?
Quartet 1: Data is scattered but follows a similar trend
50
What is the assumption that residual variation is normally distributed (that is where the confidence interval comes from) for the regression model?
- Normality assumption: At each value of X, there is a normally distributed population of Y-values with the mean on the true regression line. - One can estimate the model even if residuals are not normally distributed, but one cannot generalize the model to predict other observations in the statistical population or make inferences (e.g., p-value, confidence intervals, t-tests, ANOVAs).
51
What is the assumption that residual variation is homoscedastic (constant across the range of X values) in regression models?
- Homoscedasticity assumption: At each value of X, there is a normally distributed population of Y-values with the mean on the true regression line. The variance of the Y-values is assumed to be the same for every value of X. - Can't generalize the model to predict other observations if this assumption isn't met
52
What is the assumption that values of X (predictor) is measured without error (hard to assess, often assumed) for a regression model?
When values of x are measured with error, it results in a change in slope that does not properly predict the values of Y. --> ERROR IN X REDUCES SLOPES.
53
What is an approach to the problem of measuring x values with error in regression models?
One approach to this problem is the so-called Type II regression models: 1) Vertical: Residuals for Type I regression Error in Y but not in X 2) Perpendicular: Residuals for Type II regression Error in both Y and X
54
When performing a boxplot of both Type I and Type II regression, what can be seen?
1) Type I: values more clustered around the mean with smaller whiskers indicating low variability in data 2) Type II: wider boxpot so values are less clustered around the mean and longer whiskers indicating more variability in the data
55
Type II regression is not biased but greater standard error (sampling variation): but why?
This is obvious because both X and Y have errors.
56
What must one be careful of when residuals are non-independent in regression models?
When residuals are non independent, one should be careful about making inferences (e.g., p-value, confidence intervals, t-tests, ANOVAs)
57
What are 2 tests that can be used to test whether the regression slope differs from zero?
1) Using a t-test 2) Using ANOVA (same H0 and HA)