Multiple Regression Flashcards

Question

What is the adjusted R squared formula

Answer 1

The adjusted r squared formula equals 1 - (n-1)/ n - k - 1 All times by (1-R SQUARED)

Answer 2

Adjusted r squared is always less than or equal to r squared

Answer 3

The standard error of estimates measures the standard deviation of residuals, and it indicates how well the model captures the relationship. It is calculated as the square root of hte man squared error SEE = MSE^1/2

Answer 4

The MSE is equat to the sum squared errors divided by the number of degrees of freedom assosiated with the sum squared errors Therefore the mean squared errors is equal to the SSE / n - k - 1

Answer 5

The AKAIKE or AIC is used when the goal is to forecast The AIC formula is n * ln (SSE/n) + 2(k+1) Lower values are better for the AIC because they indicate that the model fits the outcomes better.

Answer 6

You should use it if your goal is to produce a better goodness of fit. The BIC formula is n * ln (SSE/n) + ln (n) * (k + 1) Both models will include a penalty for including more variable , however the BIC will impose a higher penalty for overfitting than the AIC will

Answer 7

Because you would have to use individual t tests on each new variable, however when using the f test the independent variables are correlated with one another so it allows you to see if the variables are valuable collectively.

Answer 8

A joint hypothesis test is used to determine if a group of variables collectively contributes to the explanatory power of the model. Basically you are testing wether a model where you have more variables added is better or worse than a model that has less variables

Answer 9

In a joint test, the null hypothesis is that all the slope coefficients are equal to zero. The alternative hypnotists is that at least one of the slope coefficients is not equal to zero.

Answer 10

To perform the test you compare the two versions of the model The restricted model where you have removed the variables that you are interested in testing And the unrestricted model, which is a simpler version which exclude the independent variables.

Answer 11

The f statistic for joint tests equals (SEE restricted - SSE unrestricted)/Q All divided by SSE unrestricted / (n - k - 1) Basically how much more of the error are you able to explain by adding in the additional variables.

Answer 12

The definition of q is the number of extra variables that you are testing, so the difference between the number of variables in the unrestricted and the restricted model.

Answer 13

F = MSR / MSE MSR equals the mean squared residual MSE is equal to the mean squared error

Answer 14

It’s a one tailed testy because you are testing if all of the coefficients are equal to zero at the same time, which means that it’s a one tailed test. You have to reject the null hypothesis if the calculated f stat is greater than the critical f value If you reject the original hypothesis then you know that the excluded variables provide a lot of explanaritory power, however if you don’t then you know that those variables were basically useless

Answer 15

Yes you have to include all of the variables, because if you were to just not include some of the variables in a model that they had given you, then you would end up mis calcuating the value becuase the variables interact with one another and if oyu were to just take one out then you woudl have to likely recalculate the whole thing. Uinits of measurement. Have to be careful that oyu put the correct units of measurement into the calculation.

Answer 16

Model specification is when you select the explanitory variables to include in the regression model

Answer 17

ECONOMIC RATIONAL Parsimony - the model should be simple and efficient Appropriate functional form - the relationship between the variables should be identified No assumption violations - the model must not violate the core regression assumptions Out of sample performance - the model should demonstrate a strong predictive power when applied to data not used in initial estimation

Answer 18

Omitting variables Inappropriate variable form Inappropriate scaling Incorrect pooling of data Time series misspecification

Answer 19

If the variable is correlated with other independent variables the estimated coefficients will be BIASED If the omitted variable is uncorrelated with the other regressions the slope might be correct but the intercept incorrect

Answer 20

Lagged dependent variables basically whwen you include a variable from the wrong time series Including variables that are measured with significant error.

Answer 21

When you regress a relationship over a full time period, however the relationship between the variables actually changes over sub periods, this might mean that oyu have to split the data out into different periods of time like 3 year windows.

Answer 22

When the variance of the error term is not constant across all observations

Answer 23

Unconditional and conditional hetroskedasticity

Answer 24

Occurs when the error variance is not related to the values of the independent variables. Does NOT create major issues for statistical inference

Answer 25

Conditional heteroskedasticity is the type where the variance of the error term is correlated with the values of the independent variable, this is a big problem in statistic

Answer 26

Conditional heteroskedasticity means that you have biased standard errors, and they are usually underestimated You also likely have inflated t statistics And unreliable f tests You are also likely to run into type one errors where you incorrectly reject the null hypothesis.

Answer 27

You could use the visual inspection, by plotting the error terms versus the predicted values You use use the BP breusch pagan test which si the formal statistical test

Answer 28

the breusch pagan test involves regressing the squared residuals from the original model on the independent variables. T stat is calclulated as n * r squared The statistic follows a chi squared distribution and it has k degrees of freedom

Answer 29

Through robust standard errors. This is also known as the white corrected or Hansen method, where oyu adjust he standard errors upward to account for the heteroskedasticity which leads to more accurate t stats What is the generalised least squares method This is when you modify the original regression equation to eliminate the heteroskedasticity

Answer 30

It’s also called serial correlation Serial correlation is when the error terms are correlated with one another Therefore the error in one period is related to the error in another period Most common in time series data

Answer 31

Positive serial correlation - when the positive error for one period increases the likelyhood of another positive error for another period If the stock price is up today they are likely to be up tomorrow Negative serial correlation - when a positive error for one observation increases the likelyhood of a negative error for another observation.

Answer 32

Coefficient consistency. The serial correlation actually has no impact whatsoever on the regression coefficients. Biased standard errors. Standard errors for the regression coefficients are usually underestimated High t stats - because the standard errors that you estimate are too small, this means that the calculated t statistic can become artificially large, F stat also inflated

Answer 33

Visual inspection The Durban Watson test. The test statistic of the Durban Watson test is quell to 2(1-r) where r is the sample correlation between the residuals. The Breusch Godfrey test. More robust method that can detect serial corelation beyond the first lag

Answer 34

When you have a value of 2 it means that there was no serial correlation However values between 0 and 2 indicate a positive correlation. Values between 2 and 4 indicate a negative correlation. The Durban Watson test does not work on autoregressive models

Answer 35

HANSEN METHOD - Newley west estimator - this calcuales the robust standard errors that adjust for homoskedasticity and heteroskedasticity symletaniously Modify the regression - one common fix is to add a lagged variable to account for the correlation Other methods. Such as using instrumental variables or panel data methods

Answer 36

It is a violation of the assumption that there is not a linear relationship between one or more of the variables. Basically makes sure that two or more of hte independent variables are not correlated with one another

Answer 37

You have inflated standard errors - because the standard errors get larger Thou have depressed t stats - because the standard errors is inflated the t stats for each coefficient becomes small You have unreliable coefficients - the estimates of the regression coeffieincents become imprecise and unreliable And oyu have inconsistent power - small changes in the data can cause the estimated coefficients to change significantly.

Answer 38

The classic symptom - model has a high r squared and a significant f stat, but hte individual t tests are not significant This basically means that the model on the whole is good at estimating the dependant variable but you cannot see which of the variables is actiualyl doing most of the estimation. Pairwise correlations - high pairwise correlations current the baroness are a good sign. However oyu can still have multicolinearity even if the pairwise correlations are low Varience inflation factor This is the most formal quantitave measure. VIF = 1 means taht there is no correlation between the variable and the other regressors VIF > 5 need further investigation VIF >10 you have serious multicolinearity

Answer 39

Most effective is to Omit variables - drop one or more of the highly correlated variables Use a different proxy - replace one of the variables that is correlated with another variable that exaplins the same economic reality it is not correlated with the other variable Increase the sample size - increasing the number of observations can help you model more accurately.

Answer 40

It actually doesn’t have any effect at all on the overall r squared score and also doesn’t have any impact on the overall f test.

Answer 41

VIF = 1/(1-r^2)

Answer 42

Influence analysis is when you identify the specific things that have the biggest impact on the regression model

Answer 43

Outliers - extreme variations of Y High leverage pints - extreme vatiables of X (the independant variables) Influential data points - extreme observations that when excluded cause significant change to the model coefficients.

Answer 44

Leverage Leverage measure the distance between an observation of an independent variable and the sample mean. Leverage can range between 0 and 1 A observation is influential if leverage is greater than three times the average leverage. The thereshold formula is as follows : 3(k+1)/n where k is the number of independent variables and n is the number of observations.

Answer 45

A studentized residual specifically identifies outliers in the independent variable. You re estimate the model one variable at a time, then you compare the predicted value after you have deleted that variable to the predicted y value Then the difference between these two values is divided by the standard deviation. Analysis compare the absolute value of the studentized residual to a critical value o the t distribution with n - k - 2 degrees of freedom. If the RESIDUAL IS BIGGER THAN THE CRITICAL VALUE THEN THE POINT IS AN OUTLIER

Answer 46

Zooms do is a composite that identifies the influential observations by considering the x and y values. Theresholds can vary but sometimes cooks d is influential if it exceeds a specific value (k/n) ^1/2

Answer 47

Windsorisation Harmonisation Input error Omitted variables

Answer 48

They are variables that take a value of 0 or 1 depending on whether they are true or false.

Answer 49

In order to distinguish between n catagories you need to use n-1 dummy variables. If you don’t do n -1 then you violate the assumption that there is no exact linear relationship between variables.

Answer 50

The intercept is the average value of the dependent variable for the omitted category (the control) The dummy coefficient indicates the estimated difference in the dependant variable for that category relative to the average valeu of the reference category

Answer 51

Intercept dummy’s These shift the intercept of the regression line up or down. The slope remains the same Slope dummies These change the slope fo the regression line for a specific catagory. The interaction term for the slope dummy is found by multiplying the dummy variable by a continuous independent variable. It captures how the relationship between x and y changes best on a given factor

Answer 52

To predict a value based on the dummy variable model, you just type 1 or 0 into the variable that is true. If the catagory is the reference group you SET ALL VARIABLES TO 0 the predicted value is based only on the intercept. If the catagory is the dummy Farouk oyu set that variable to 1 and the rest to 0 it gives you an estimate of that point.

Answer 53

Logicsti regression is when the dependent variable is qualitative this means for example that the value falls between zero and one. The logistic regression tronsofms a probability value into log odds

Answer 54

It’s the natural log of the odds ln(p/1-p)

Answer 55

Assumes that the residuals follow a logistic distribution. This is similar to a normal distribution but has fatter tails. Unlike the OLS which minimises the sum squared errors Lego’s coefficients are estimated using mle (maximum likelyhood estimation) MLE seeks values taht maximise the likelyhood of observing actual data.

Answer 56

Because the model is non linear, the intercept represents log odds, when all independant variables are equal to zero The slope coefficient because the function is curved, the change in probability for a one unit change in independent variable is NON CONSTANT In order to estimate the coefficients, you should use the average value for each of the independent variables, and find out what hte valeu is. Then you increase one of the variables by one, and then look at what the independent variable is. The difference bwetween these two values is the impact of that variable.

Answer 57

1 you calculate y buy plugging the assumed x values into the equation 2 you calculate the odds. = e^y 3 you calculated the probability = p = odds / 1 + odds

Answer 58

You cannot use the r squared You use the likelyhood ratio test LR test This test is also used in nested model, and it follows a chi squared distribution which q degrees of freedom. You can also use the log likelyhood test, which is always negative. The higher values mean that there’s a better fit. Or you can use the pseudo r square which has values that are reported by software that compare competing models for the same variable.

Multiple Regression Flashcards

(84 cards)