What is survival analysis? What are the key statistics?
Survival analysis is analysis in which the outcome variable is time until the occurrence of an event of interest
Survival function: The probability of experiencing the event of interest, at least to time t. Illustrated on a survival curve
Hazard function: conditional probability of experiencing the event of interest at time t, having survived to that time (ie instantaneous rate of event)
Describe the key ways survival analysis is done? What are the key assumptions?
Kaplan-Meier
- Calculates the survival function each time an event occurs and visually shows survival on a curve. There is a step at each occurrence of an event or censoring
- Non-parametric
- Can assess statistical significant difference between two groups using Log Rank test. But this cannot tell us magnitude of difference and cannot account for covariates.
Cox proportional hazards regression
- Proportional hazards assumption: relative hazard remains constant over the time period (survival curves don’t cross over)
-Results give you log hazard ratio which can be exponentiated to the HR
Other key assumptions:
- Censoring must be non-informative (probability of censoring must be unrelated to probability of having event)
-Effects of predictor variables on survival must be constant over time and multiplicatively related to the hazard (e.g risk of men vs women must be same for young vs. old)
What is a censored observation and how do you deal with censoring issues?
A censored observation is one where the event in question (such as death or discharge from hospital) has not happened at the time of the analysis, and all we know is the length of time the subject has been in the study
- Right censoring: withdrawing from study, lost to follow up, end of study without experiencing outcome.
- Left censoring: event of interest occurs before the start of observation period.
- Interval censoring: Know the interval that the event of interest occurred in but not exact date
Dealing with censoring issues
- Can impute data
- Could remove censored data
- Sensitivity analysis with best and worst case scenarios for missing data
T- tests
- When are they used and what are the different types?
- What are the assumptions of a t-test?
- What is the non-parametric alternative?
-Used for comparing two means.
- t-statistic used to test for differences
- Parametric test (normally distributed data or using central limit theorum)
Types
- Single: data is collected from a single sample and compared to a pre-specified fixed value. This can be a known or hypothetical value
- Independent: comparing means between two independent samples (independent variable is binary and dependent variable is numeric)
- Paired: The same participants are used in each of the experimental conditions. Paired samples designs are used when researchers want to make inferences about population-level differences attributable to a specific intervention, experimental condition, or over time.
Assumptions:
- The data are Normally distributed
- The variance in both sample populations are roughly equal
- There are no outliers
- The dependent variable is numerical
- Observations are independent.
- Errors or deviations from the expected values are due to random chance rather than systematic biases or confounding factors.
- For paired tests: size of pair is not associated with size of difference, and all assumptions relate to differences between each matched pair. Bias can be introduced through order effects
Non-parametric alternative:
- Independent: Mann-Whitney U test
- Single or paired: Wilcoxon Signed Rank test
When is a z-test used? What are the assumptions?
Comparison of two means or proportions (binary exposure and outcome variables)
Parametric test therefore assumptions are:
- Normal distribution (samples must be n >15 for this to be true)
- Binary variables
-Independence of samples
Can be independent, single or paired
When is a Chi-squared test used? What are the assumptions?
Non-parametric test used for comparing counts of categorical responses between two or more independent groups
Data should be placed in r x c contingency table where r is the exposure groups in rows and c is the number of possible outcomes in columns
Assumptions:
- Independent subjects
- Large enough expected values: To determine if the expected values are large enough, no more that 20% < 5, and none <1.
Chi squared test for trend can be calculated when exposure category is ordinal and the other is binary
How do you calculate a chi-squared test for a 2x2 table?
What is the 95% CI
(DEAD)
D - 95% chance that the true population parameter lies within 1.96 standard errors above or below any sample statistic. Therefore 95% CI is the range within which we can be 95% certain that the true effect lies.
E - If the test was repeated 100 times with random samples of the population, 95 of those times the true population parameter will lie within this range. 95%CI = sample statistic +/- (1.96xse)
A - Allows reader to assess how precise sample estimate is and assess statistical significance in absence of p value.
D - Only valid if the experiment is valid and unbiased.
What is the normal distribution?
What is a binomial distribution?
What is the poisson distribution?
Aside from normal, binomial and poisson distributions, name 3 other distributions and give one use of each?
What is statistical inference and what are the two main methods?
What is sampling error?
What is the central limit theorum?
What is conditional probability? What does it allow epidemiologists to do?
The probability that event B will occur given that even A has already occurred
P(B|A) = P(A and B)/ P(A)
Allows epidemiologists to evaluate how different treatments or exposures influence the probability that outcomes, such as disease or mortality, occur. Also a good way to evaluate diagnostic tests.
What is the difference between a type 1 and type 2 error?
Errors occur if finding of study sample does not reflect true finding in the population.
Power calculation done before study to determine appropriate study sample size.
Type 2 error considered less serious because it is only an error in that an opportunity to reject the null hypothesis was missed. Usually due to sample size being too small.
- Probability of a type 2 error is called Beta -> Power = 1-Beta
How do you calculate an odds ratio?
What is probability and how is it expressed?
A measure of the likelihood that an event will occur.
Expressed as a positive number between 0 (event will never occur) and 1 (event is certain to occur)
What are the rules determining how two or more probabilities can be combined?
Addition rule
- Used to find the probability P that at least one event will occur out of 2 or more possible events
- Non mutually-exclusive events:
P(A or B) = 1 - P(neither A nor B)
- Mutually exclusive events:
P(A or B) = P(A) + P(B)
Multiplication rule
- The probability of a joint occurrence of 2 or more events.
- P(A and B) = P(A) x P(B|A)
- For independent events, event B is not affected by event A, therefore P(B|A) = P(B) and therefore the equation can be simplified to:
P(A and B) = P(A) x P(B)
What 4 key factors do you have to consider in a sample size calculation?
Give some situations in which a sample size could be increased or decreased.
Increased:
- High LTFU
- Low response rate
- Cluster sampling
- Confounding
- Interaction
Decreased:
- Matched case controls
Increasing sample size can only reduce likelihood errors are caused by chance and cannot compensate for bias.
What is the problem of multiple comparisons and how doe you account for this?
What are the two broad types of measures in descriptive statistics that describe numerical continuous data and give examples of each
Measures of location
- Mean
- Median
- Mode
Measures of spread
- Standard deviation
- Interquartile range
- Skew
- Variance