Lecture 18: Regression: Flashcards

Question

When is regression significant?

Answer 1

Regression is significant when the slope is different from zero

Answer 2

sampling variation.

Answer 3

The regression slope b divided by its standard error can be used to test the null hypothesis that 𝛽 = 0. This is similar to the one-sample t-test: --> Dividing data by the standard error becomes t-distributed

Answer 4

reject the H0 and conclude that the regression model can predict Y values.

Answer 5

Increase in residuals have more variation leading to greater standard error and a smaller t-value such that there’s less statistical power to reject the null hypothesis

Answer 6

Both the slope and the residual variation

Answer 7

We can measure the fraction of variation in Y (age) that is “explained” by X in the estimated linear regression model using a quantity called “coefficient of determination” or the “famous” R2:

Answer 8

Sum of the squares of regression/Sum of the total

Answer 9

The maximum amount of variation in Y that could be explained by any linear regression model is the total sum-of-squares of Y

Answer 10

The amount of variation in Y (age) that the regression model with proportion of X (black spots) as a predictor is the regression sum-of-squares The predicted values - the mean of the observed values → a measure of quality.

Answer 11

- We state then that the regression model explains 62.38% of the total variation in Y. Ex: 62% of the age of lions can be predicted using the proportion of black spots of the noses of male lions.

Answer 12

NO: SPURIOUS CORRELATION: --> correlation between 2 variables having no causal relation.

Answer 13

Correlation between 2 variables having no causal relation.

Answer 14

Confidence bands describe the uncertainty around the mean relationship between X and Y.

Answer 15

Because individual data points include residual variation, many observed values will naturally fall outside the confidence bands. We are interested in the confidence of the slope

Answer 16

Prediction bands describe the uncertainty around individual observations, so they are always wider because they include both uncertainty in the mean and residual variation.

Answer 17

1) Confidence bands describe the uncertainty around the mean relationship between X and Y. 2) Prediction bands describe the uncertainty around individual observations, so they are always wider because they include both uncertainty in the mean and residual variation.

Answer 18

- Ear length= 55.9+0.22(age). Our ears grow longer, about 0.22mm per year. - The intercept here predicts ear length at birth (X=0 years); a baby does not have ears of 56mm (i.e., 5.6cm)!! - Slope is changing and at birth, the predictions do not hold as babies do not have ears of 5.6cm --> So, predictions hold well within the range of X values but not outside.

Answer 19

Extrapolation: the action of estimating or concluding something by assuming that existing trends will continue or a current method will remain applicable.

Answer 20

Ensure that the distribution of predictor value is approximately uniform within the sampled range: the standard error cannot tell you that: - Uniform distribution of the predictor variable must occur or else it indicates biased sampling leading to a Type I error

Answer 21

1) Linearity; It is critical to graph the data 2) All observations have similar influences on the regression model 3) Residual variation is normally distributed; that is where the confidence interval comes from 4) Residual variation is homoscedastic (constant across the range of X values) 5) Values of X (predictor) is measured without error (hard to assess, often assumed) 6) Residuals are independent: this is the assumption in which data are sampled randomly

Answer 22

- Plotting the residuals against predictor values is critical in assessing whether a linear model is appropriate. - The horizontal line is the average of residuals (which is always zero as a result of the fitting method). - If variance is greater in different parts of the line, this indicates lack of linearity or heteroscedasticity --> dots should be mostly clustered around the line

Answer 23

Francis Anscombe’s quartets: Quartet 1 is the only appropriate in the sense that all observations have the same influence on the model, i.e., removal of one observation won’t affect the model much. There are different methods to estimate the influence of each observation on the model (advanced level).

Answer 24

It comprises four data sets that have nearly identical simple descriptive statistics and regression models. Yet, they have very different distributions and appear very different when graphed. These data demonstrate both the importance of graphing data before analyzing it and the influence of influential observations (outliers). --> All Quartets have the same regression model and R^2

Answer 25

Quartet 1: Data is scattered but follows a similar trend

Answer 26

- Normality assumption: At each value of X, there is a normally distributed population of Y-values with the mean on the true regression line. - One can estimate the model even if residuals are not normally distributed, but one cannot generalize the model to predict other observations in the statistical population or make inferences (e.g., p-value, confidence intervals, t-tests, ANOVAs).

Answer 27

- Homoscedasticity assumption: At each value of X, there is a normally distributed population of Y-values with the mean on the true regression line. The variance of the Y-values is assumed to be the same for every value of X. - Can't generalize the model to predict other observations if this assumption isn't met

Answer 28

When values of x are measured with error, it results in a change in slope that does not properly predict the values of Y. --> ERROR IN X REDUCES SLOPES.

Answer 29

One approach to this problem is the so-called Type II regression models: 1) Vertical: Residuals for Type I regression Error in Y but not in X 2) Perpendicular: Residuals for Type II regression Error in both Y and X

Answer 30

1) Type I: values more clustered around the mean with smaller whiskers indicating low variability in data 2) Type II: wider boxpot so values are less clustered around the mean and longer whiskers indicating more variability in the data

Answer 31

This is obvious because both X and Y have errors.

Answer 32

When residuals are non independent, one should be careful about making inferences (e.g., p-value, confidence intervals, t-tests, ANOVAs)

Answer 33

1) Using a t-test 2) Using ANOVA (same H0 and HA)

Lecture 18: Regression: Flashcards

(57 cards)