Reading 1: Multiple Regression Flashcards

Question

What are the degrees of freedom for the F statistic nested model?

Answer 1

numerator df = q = number of excluded variables in the restricted model denominator = n-k-1 k = number of independent variables in full model

Answer 2

Null Hypothesis (H₀): The coefficients of the removed predictors are zero. i.e. useless Alternative Hypothesis (H₁): At least one of the removed predictors has a non-zero coefficient. i.e. not useless and at least one variable is pulling their weight in explaining the var in y

Answer 3

Reject null, test is statistically significant. Full model provides a significantly better fit than the nested model. The relative decrease in SSE due to the inclusion of q additional variables is statistically justified i.e. improve the model.

Answer 4

To compare the fit of the regression model to a model with no predictors i.e. no slope coefficients.

Answer 5

mean of predicted variation/mean of unpredicted variation MSE = SSE/n-k-1 | MSR = SSR/K

Answer 6

Null Hypothesis (H₀): The model with no predictors fits the data as well as the regression model. i.e. the slope co-efficients on all x variables in the unrestricted model = 0 Alternative Hypothesis (H₁): The regression model provides a better fit than the model with no predictors.

Answer 7

It indicates that the regression model explains a substantial portion of the variance in the response variable.

Answer 8

1. omitting a variable that should be included 2. variable transformation i.e. for linearity 3. inappropriate scaling of the variable 4. incorrectly pooling data

Answer 9

A variable might need to be transformed to ensure the relationship between the predictor and response variable is linear. e.g. converting market cap to the log of market cap, logs makes it linear Violations: heteroskedasticity in the residuals Explanation: Transforming variables (e.g., using logarithms or square roots) can help linearize relationships, making the model more accurate and easier to interpret. Non-linear relationships can lead to poor model fit and misleading results.

Answer 10

Omitting a variable can lead to model misspecification, resulting in biased and inconsistent estimates. Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: When a relevant variable is omitted, the model fails to account for its effect, which can distort the relationships between the included variables and the response variable. This can lead to incorrect conclusions and predictions.

Answer 11

Inappropriate scaling can affect the model's accuracy and interpretability. e.g. using number of free float shares rather than proprtion Potential violations: heteroskedasticity/multicollinearity Explanation: Variables should be scaled appropriately to ensure they contribute correctly to the model. Incorrect scaling can lead to disproportionate influence of certain variables, skewing the results and making the model less reliable.

Answer 12

Incorrectly pooling data refers to combining data from different regimes or contexts without accounting for their differences. e.g. difference beween pre and post covid/GFC Potential violations: serial correlation or heteroskedasticity in the residuals Explanation: Pooling data from different regimes can lead to misleading results, as the underlying relationships may differ across contexts. It's important to account for these differences to ensure the model accurately reflects the data.

Answer 13

Heteroskedasticity occurs when the variance of the errors in a regression model is not constant There are two types: conditional and unconditional. Conditional is problematic as relates to independent variables

Answer 14

T and F stats (hypothesis tests and confidence intervals) become unreliable. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. For financial data, most likely the standard errors are understated and t stats inflated (too high causing type 1 errors) Explanation: When heteroskedasticity is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient.

Answer 15

* Scatter diagram: plot residual against each independent variable and against time. e.g. as variable gets larger, error term gets larger, should be randomly distributed around x variable. Breusch Pagan test: regress squared residuals on X variables.

Answer 16

Null hypothesis (H₀): The variance of the residuals is constant (homoskedasticity). Alternative (H₁): The variance depends on X (heteroskedasticity). if BP > critical value reject the null and conclude you have a problem. If the test statistic exceeds the critical value from the chi-squared distribution (or if the p-value is low), you reject H₀. This means your model likely has heteroskedasticity, which can affect inference.

Answer 17

Explanation: In a regression model, serial correlation means that the errors from one observation are related to the errors from another observation, violating the assumption of independence. Answer: Serial correlation, also known as autocorrelation, occurs when the residuals (errors) in a regression model are correlated across observations. e.g. the residual in the current period is positive and the probability of the residiual in the next period being positive is greater than 50%

Answer 18

T and F stats (hypothesis tests and confidence intervals) become unreliable. inefficient estimates and biased standard errors. Slope co-efficient estimates are not affected, however, the standard error becomes unreliable. Explanation: When serial correlation is present, the ordinary least squares (OLS) estimates remain unbiased, but they are no longer efficient. This means that the standard errors of the coefficients are incorrect, leading to unreliable hypothesis tests and confidence intervals.

Answer 19

Positive serial correlation: Standard error is too low. (t stat too high) Negative serial correlation: standard error is too high (t stat too low) Positive serial correlation means errors tend to move in the same direction across observations (e.g., time periods). This makes the model seem more stable than it is → standard errors shrink → t-stats grow → you might wrongly think predictors are significant. Negative serial correlation means errors tend to alternate direction. This exaggerates variability → standard errors grow → t-stats shrink → you might miss real effects.

Answer 20

Answer: Serial correlation can be detected using graphical methods (e.g., residual plots/scatter) and statistical tests (e.g., Durbin-Watson test, Breusch-Godfrey test). Durbin-Watson: tests one lag Breusch-Godfrey: tests several lags, uses residuals as the y variable. residuals are run against initial regressors plus lagged residuals * F distribution * p (numerator)and n-p-k-1 (denominator) dof

Answer 21

* Run your original regression and save the residuals. * Create an auxiliary/second regression: regress residuals on original X variables plus lagged residuals (e.g., εt−1,εt−2,…) * Compute the test statistic: BG=n×R2 from the auxiliary or 2nd regression. * Compare to chi-squared distribution with degrees of freedom equal to the number of lags. Interpret: If BG > critical value → reject H₀ → serial correlation is present. If not → residuals are likely independent.

Answer 22

H₀: No serial correlation (errors are independent) H₁: Serial correlation exists (errors are autocorrelated) BG=n×R2 from the auxiliary or 2nd regression. * Compare to chi-squared distribution with degrees of freedom equal to the number of lags. Interpret: If BG > critical value → reject H₀ → serial correlation is present. If not → residuals are likely independent.

Answer 23

* use robust standard errors Newey West corrected standard errors for serial correlation White-corrected standard errors for conditional heteroskedasticity

Answer 24

Answer: Multicollinearity occurs when two or more predictor (x) variables in a regression model are highly correlated, making it difficult to isolate their individual effects on the response variable. e.g. if 4 friends are pushing a car when it breaks down, who is doing most of the work Explanation: When predictors are highly correlated, it becomes challenging to determine the unique contribution of each predictor to the response variable, leading to issues in the regression analysis.

Answer 25

Answer: * Multicollinearity can lead to inflated standard errors, * unreliable coefficient estimates, * reduces T stats, * increases chance of Type II errors (X variables seem less valuable because they are sharing credit with other variables) where t stats are artificially small, variables look falesly unimportant. Explanation: High correlation among predictors can cause instability in the coefficient estimates, making them sensitive to small changes in the model. This results in large standard errors and unreliable hypothesis tests.

Answer 26

* Significant F stat (low p value f stats < 0.05 = significant) (and high r2), but all t stats/p values insignificant (p value > 0.05 = insiginificant) - model is doing a good job in explaining variation in y BUT doesn't know which variable is doing the work based on the regression therefore high F, low p val/t stat * high correlation between x variables (k=2 case only) * High Variance Inflation Factor (VIF). VIF = 1 = no correlation, therefore no evidence of mutlicollinearity VIF > 5 = further investigation VIF > 10 = SERIOUS multicollinerarity needs correction. VIF = 1/(1-r2)

Answer 27

* High Variance Inflation Factor (VIF). VIF = 1 = no correlation, therefore no evidence of mutlicollinearity VIF > 5 = further investigation VIF > 10 = SERIOUS multicollinerarity needs correction. VIF = 1/(1-r2)

Answer 28

* Remove one or more regression variables * use different proxy for one of the variables e.g. liquidity can be bid ask instead of free float * increase the sample size, more statistically robust

Answer 29

high leverage point - obs with extreme independent/ x var outlier - obs with extreme dependent/ y var

Answer 30

Standardised measure of distance of observation j from the mean and takes on a value between 0 and 1. 3 x (k+1/n) --> if leverage is greater than this, the observation is potentially influential k is the number of independent variables

Answer 31

Measure for identifying an outlier. Delete observation j, estimate reression model using n-1 observations. Estimate y hat and ej then calculate studentised ej for each observation in dataset. critical value acts as a ceiling, if the absolute value of studentised residual is greater than the t value REJECT. (doesnt matter if positive or negative - two tail t test. rejected = outlier degrees of freedom for critical value n-k-2 (because we deleted an obs in the beginning)

Answer 32

Purpose: They allow categorical variables (like gender, region, or type) to be included in regression models. Representation: Each category is represented by a binary variable (0 or 1).

Answer 33

n-1 to avoid multicollinerarity

Answer 34

Purpose: Adjust the intercept of the regression model for different categories of a categorical variable. D either equals 0 or 1. If 0, whole term = 0, if 1 whole term = b1 How It Works: Each dummy variable shifts the intercept of the regression line for its respective category. The coefficients of intercept dummy variables represent the difference in the intercept for each category compared to the reference group.

Answer 35

Purpose: Adjust the slope of the regression model for different categories of a categorical variable. DX captures the change in the slope on account of the dummy variable. How It Works: Each dummy variable interacts with a continuous predictor to change the slope of the regression line for its respective category. The coefficients of slope dummy variables represent the difference in the slope for each category compared to the reference group.

Answer 36

A logistic regression model is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables. i.e. failure, success or increase, decrease They estimate the probaility (log odds) of an event based on the logistic distribution.

Answer 37

The coefficients (beta) represent the change in the log-odds of the outcome for a one-unit increase in the x variable.

Answer 38

Intercept (b_0 = -2): The log-odds of passing the exam when study hours and attendance are zero. Study Hours (b_1 = 0.05): For each additional hour of study, the log-odds of passing the exam increase by 0.05. Odds Ratio: e^{0.05} \approx 1.051. Each additional hour of study multiplies the odds of passing by approximately 1.051. Attendance (b_2 = 0.3): For each additional unit of attendance, the log-odds of passing the exam increase by 0.3. Odds Ratio: e^{0.3} \approx 1.35. Each additional unit of attendance multiplies the odds of passing by approximately 1.35. Positive Coefficient: Indicates an increase in the log-odds (and thus the odds) of the outcome. Negative Coefficient: Indicates a decrease in the log-odds (and thus the odds) of the outcome.

Answer 39

to evaluate competing models with the same dependent variable. higher value = better fit

Answer 40

e to the co-efficient will convert the log of odds to odds. p/(1+p) will convert to probability

Answer 41

when the p value is smaller than the critical value/alpha. means it is statistically significant

Answer 42

A likelihood ratio is a statistical measure used to compare the goodness of fit between two models. In the context of regression, it helps determine whether a more complex model significantly improves the fit of the data compared to a simpler model. Chi square distribution with q dof. q = omitted variables in the restricted model 1 tail test reject null if chi square > critical value. means omitted values are useless, do not add to explanatory power (= far from 0) LR = -2 (log likelihood restricted model - log liklihood unrestricted model) log likelihood metric = negative higher values = better fitting model

Reading 1: Multiple Regression Flashcards

(70 cards)