Why would we include more than one variable in a model?
When conducting regression with binary data what can be used?
Logistic regression
What does logistic regression involve?
Modelling the log odds (logit) of the probability of outcome
Why is the logit link used instead of the log odds?
The logit link is used so we can model probability on a linear scale;probability is bounded between 0 and 1
The log odds takes any number from -infinity to +infinity. It is unbounded
What do we measure the log odds with in a logistic regression?
A linear predictor
To interpret coefficients how must we exponentiate them?
exp(Bo) = eBo= odds of Y = 1 when x1= 0
if x1 is binary
exp(B1) = eB1= odds ratio comparing x1 = 1 when x1= 0
if x1 is continuous
exp(B1) = eB1= odds ratio comparing x1 = x+1 to x1=x
i.e odds ratio for a one unit increase in x1
What else needs to be exponentiated?
Confidence intervals
Lower limits(OR) = ell
Upper limits(OR) = eul
Standard errors are given on the log scale and thus do not need to calculate CIs from standard errors on the odds ratio scale
An odds ratio of 1 means no effect - CI significant if excludes 1
p-values need no exponentiating
When doing logistic regression what is assumed?
The linear predictor can contain as many variables as we need
True or false
True
log(p(y=1)/1-P(Y=1) = B0, B1x1, B2x2, B3x3, B4x4…
Exponentiating any coefficient gives what?
An odds ratio that can be interpreted as with the single variable model
What are coefficients now conditional upon?
Other coefficients in the model
How must categorical variables be handled and what is this sometimes called?
By splitting into multiple binary variables
Sometimes called one hot endcoding
There will be one less variable than categories, the category with no variable is the reference category
* E.g. race (coded as White, Black, Hispanic, Other)
Coefficients for categorical variables are relative to what?
Reference category
How must categorical variables be coded to be used in regression and what can they have?
Numerically and they can have value labels
Putting what in front of a variable in stata tells it to treat it as categorical?
What happens if this is not included?
an i.
Stata will treat the variable as continuous
In stata what is the default category?
The one with the lowest number
- The default category can be changed by having b# in front of the i
- They can have value labels
Eg. to have the category coded as 4 as reference
You add more than one variable and will get more than one p-value reported
Correctly test whether there is an association between the categorical variable and the outcome a joint test must be carried out
Two types of test:
Wald test: Simpler to implement
Likelihood ratio test: Better statistical properties
What can a likelihood ratio
test be used to compare?
Nested models
What does likelihood refer to?
How likely the data is to be observed based on the parameter estimates.
What do Likelihood ratio tests compare?
The likelihood from the two models
Fit both models
Compare the likelihood
What do Wald tests use?
A quadratic approximation to the likelihood to calculate p-values based on the fitted model only
The p-values you see in the output from linear regression are what?
Wald tests
Joint tests (Likelihood ratio or Wald tests) must be used when..
Testing associations of categorical variables
Odds ratios from multiple logistic regression are conditional on what?
Other variables in the model