Statistics Flashcards

(55 cards)

1
Q

What is survival analysis? What are the key statistics?

A

Survival analysis is analysis in which the outcome variable is time until the occurrence of an event of interest

Survival function: The probability of experiencing the event of interest, at least to time t. Illustrated on a survival curve

Hazard function: conditional probability of experiencing the event of interest at time t, having survived to that time (ie instantaneous rate of event)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the key ways survival analysis is done? What are the key assumptions?

A

Kaplan-Meier
- Calculates the survival function each time an event occurs and visually shows survival on a curve. There is a step at each occurrence of an event or censoring
- Non-parametric
- Can assess statistical significant difference between two groups using Log Rank test. But this cannot tell us magnitude of difference and cannot account for covariates.

Cox proportional hazards regression
- Proportional hazards assumption: relative hazard remains constant over the time period (survival curves don’t cross over)
-Results give you log hazard ratio which can be exponentiated to the HR

Other key assumptions:
- Censoring must be non-informative (probability of censoring must be unrelated to probability of having event)
-Effects of predictor variables on survival must be constant over time and multiplicatively related to the hazard (e.g risk of men vs women must be same for young vs. old)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a censored observation and how do you deal with censoring issues?

A

A censored observation is one where the event in question (such as death or discharge from hospital) has not happened at the time of the analysis, and all we know is the length of time the subject has been in the study
- Right censoring: withdrawing from study, lost to follow up, end of study without experiencing outcome.
- Left censoring: event of interest occurs before the start of observation period.
- Interval censoring: Know the interval that the event of interest occurred in but not exact date

Dealing with censoring issues
- Can impute data
- Could remove censored data
- Sensitivity analysis with best and worst case scenarios for missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

T- tests
- When are they used and what are the different types?
- What are the assumptions of a t-test?
- What is the non-parametric alternative?

A

-Used for comparing two means.
- t-statistic used to test for differences
- Parametric test (normally distributed data or using central limit theorum)

Types
- Single: data is collected from a single sample and compared to a pre-specified fixed value. This can be a known or hypothetical value
- Independent: comparing means between two independent samples (independent variable is binary and dependent variable is numeric)
- Paired: The same participants are used in each of the experimental conditions. Paired samples designs are used when researchers want to make inferences about population-level differences attributable to a specific intervention, experimental condition, or over time.

Assumptions:
- The data are Normally distributed
- The variance in both sample populations are roughly equal
- There are no outliers
- The dependent variable is numerical
- Observations are independent.
- Errors or deviations from the expected values are due to random chance rather than systematic biases or confounding factors.
- For paired tests: size of pair is not associated with size of difference, and all assumptions relate to differences between each matched pair. Bias can be introduced through order effects

Non-parametric alternative:
- Independent: Mann-Whitney U test
- Single or paired: Wilcoxon Signed Rank test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When is a z-test used? What are the assumptions?

A

Comparison of two means or proportions (binary exposure and outcome variables)

Parametric test therefore assumptions are:
- Normal distribution (samples must be n >15 for this to be true)
- Binary variables
-Independence of samples

Can be independent, single or paired

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When is a Chi-squared test used? What are the assumptions?

A

Non-parametric test used for comparing counts of categorical responses between two or more independent groups

Data should be placed in r x c contingency table where r is the exposure groups in rows and c is the number of possible outcomes in columns

Assumptions:
- Independent subjects
- Large enough expected values: To determine if the expected values are large enough, no more that 20% < 5, and none <1.

Chi squared test for trend can be calculated when exposure category is ordinal and the other is binary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate a chi-squared test for a 2x2 table?

A
  • rxc table
  • Row and column totals
  • Calculate expected number for each cell
  • Calculate (O-E)^2/E for each cell
  • X^2 is sum of all (O-E)^2/E for each cell
  • X^2> 3.84 corresponds to P <0.05
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the 95% CI
(DEAD)

A

D - 95% chance that the true population parameter lies within 1.96 standard errors above or below any sample statistic. Therefore 95% CI is the range within which we can be 95% certain that the true effect lies.

E - If the test was repeated 100 times with random samples of the population, 95 of those times the true population parameter will lie within this range. 95%CI = sample statistic +/- (1.96xse)

A - Allows reader to assess how precise sample estimate is and assess statistical significance in absence of p value.

D - Only valid if the experiment is valid and unbiased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the normal distribution?

A
  • Bell shaped symmetrical curve described by the mean and variance. It describes the sampling distribution of a mean (continuous outcome variable).
  • Mean = median in perfect normal distribution
  • Other distributions approximate to the normal distribution when the sample size is sufficiently large
  • Standard normal distribution have a mean of 0 and variance of 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a binomial distribution?

A
  • Shows the frequency of events that have two possible outcomes (binary) - proportion or probability.
  • Has no negative values
  • As sample size increases, it approximates to normal distribution.
  • Described by sample size and true probability.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the poisson distribution?

A
  • Shows the frequency of independent events over time in which the events occur independently of each other (count data e.g deaths from MI)
  • Used to analyse rates
  • No negative values
  • As sample size increases, it approximates to the normal distribution.
  • Assumes data is discrete, random and mean = variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Aside from normal, binomial and poisson distributions, name 3 other distributions and give one use of each?

A
  • t-distribution: estimating the mean of a normally distributed population when the sample size is small.
  • Chi-squared: Analysing categorical data
  • f-distribution: comparing two variances and more than two means using ANOVA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is statistical inference and what are the two main methods?

A
  • Process of making conclusions about a population based on observations in a sample. This is only valid if the sample correctly represents the population (unbiased random sample)
  • Estimations: point estimate (e.g mean, median, difference in means, proportion) or interval estimate (e.g confidence intervals)
  • Hypothesis testing: Assesses the likelihood that the finding observed in the sample is a true finding within the population and not due to chance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is sampling error?

A
  • Observations in a sample are unlikely to be exactly the same as those in the true population due to sampling variability, therefore they are subject to a degree of uncertainty. This is the sampling error.
  • Measured by the standard error: estimates how precisely a population parameter is estimated by the equivalent statistic in a sample. It is the standard deviation of the sampling distribution of the statistic.
  • Standard error is the basis of calculating confidence intervals and hypothesis testing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the central limit theorum?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is conditional probability? What does it allow epidemiologists to do?

A

The probability that event B will occur given that even A has already occurred

P(B|A) = P(A and B)/ P(A)

Allows epidemiologists to evaluate how different treatments or exposures influence the probability that outcomes, such as disease or mortality, occur. Also a good way to evaluate diagnostic tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the difference between a type 1 and type 2 error?

A

Errors occur if finding of study sample does not reflect true finding in the population.

Power calculation done before study to determine appropriate study sample size.

Type 2 error considered less serious because it is only an error in that an opportunity to reject the null hypothesis was missed. Usually due to sample size being too small.
- Probability of a type 2 error is called Beta -> Power = 1-Beta

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you calculate an odds ratio?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is probability and how is it expressed?

A

A measure of the likelihood that an event will occur.

Expressed as a positive number between 0 (event will never occur) and 1 (event is certain to occur)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the rules determining how two or more probabilities can be combined?

A

Addition rule
- Used to find the probability P that at least one event will occur out of 2 or more possible events
- Non mutually-exclusive events:
P(A or B) = 1 - P(neither A nor B)
- Mutually exclusive events:
P(A or B) = P(A) + P(B)

Multiplication rule
- The probability of a joint occurrence of 2 or more events.
- P(A and B) = P(A) x P(B|A)
- For independent events, event B is not affected by event A, therefore P(B|A) = P(B) and therefore the equation can be simplified to:
P(A and B) = P(A) x P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What 4 key factors do you have to consider in a sample size calculation?

A
  • Size of difference: Effect size needed to be clinically or meaningfully significant (smaller effect size needs larger sample)
  • Significance level: p-value (ie type 1 error rate) - usually set at 0.05
  • Power: Probability that the study will be able to detect a difference that truly exists. Usually a power of 80% or more is used. Higher power needs larger sample size.
  • Exposure in baseline population: smaller prevalence requires larger samples. In case-controls, this is prevalence of exposure in controls. In cohort/intervention, this is prevalence of outcome in unexposed population,
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Give some situations in which a sample size could be increased or decreased.

A

Increased:
- High LTFU
- Low response rate
- Cluster sampling
- Confounding
- Interaction

Decreased:
- Matched case controls

Increasing sample size can only reduce likelihood errors are caused by chance and cannot compensate for bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the problem of multiple comparisons and how doe you account for this?

A
  • If we repeat hypothesis testing for several variables, the likelihood of a type 1 error increases (e.g for a significance level of 0.05, if you test 20 variables separately, you are likely to find that one is significant just by chance).
  • Multivariable regression can give you a result whilst controlling for all other variables, however could not be repeated for each variable as again get problem of multiple comparisons.
  • Another method is Bonferroni correction:
    If n is the number of tests the author wishes to perform (e.g salt intake and 15 different types of cancer), and original significance level is 0.05, then new significance level would be:
    0.05/n
24
Q

What are the two broad types of measures in descriptive statistics that describe numerical continuous data and give examples of each

A

Measures of location
- Mean
- Median
- Mode

Measures of spread
- Standard deviation
- Interquartile range
- Skew
- Variance

25
What is the mean? DEAD
- D - the arithmetic mean is the average value based on the sum of a set of numbers (geometric mean is based on product) - E - mean = (sum of values)/n - A - good for statistical analysis - D - Sensitive to outliers and poor for asymmetrical distributions
26
What is the median? DEAD
- D - Value at the centre of the distribution - E - half of observations lie above and half below - A - More appropriate for skewed distributions as not as sensitive to outliers as mean. - D - value determined solely by rank so provides no information about any other values
27
What is the mode? DEAD
- D - Most commonly occurring value - E - value occurring with highest frequency - A - Not greatly affected by outliers and can provide additional info (e.g suicide is bimodal, affecting young and old adults) - D - not amenable to statistical analysis. Sometimes there is no mode, or if there is more than one mode this can be difficult to interpret.
28
What are percentiles? DEAD
- D - Values are ranked and divided into 100 groups - E - 50th percentile is median, 25th is lower quartile and 75th is upper quartile. - A - useful for comparing measurements (e.g BMI in groups of similar age and sex) - D - comparisons made at extreme ends of distribution are often less informative than those at the centre
29
Range DUAD (U=units)
- D - the difference between the maximum and minimum sample values - U - same as data - A - Intuitive and only needs two observations - D - sensitive to size of sample and outliers
30
Interquartile range (DUAD)
- D - Middle 50% of the sample (difference between upper and lower quartile) - U - same as data - A - More stable than the range as sample size increases - D - Unstable for small samples and does not allow for further mathematical manipulation
31
Variance DUAD
- D - Average squared deviation of each number from its mean - U - squared units of data - A - Takes into account all values. useful for making inferences about the population - D - Units different from units in the data
32
Standard deviation DUAD
- D - Square root of the variance - U - same as data - A - Units the same as in the data. Useful for making inferences. - D - sensitive to some extent to extreme values and will vary depending on units of observation.
33
Coefficient of variation DUAD
- D - Ratio of the standard deviation to the mean to give an idea of the size of variation relative to the size of observation. - U - no units - A - allows comparison of the variation of populations that have significantly different mean values - D - where mean value is near zero, the coefficient of variation is highly sensitive to changes in the standard deviation
34
What could the causes of the statistical result be? (checklist)
- True association - Chance - Bias - Confounding - Reverse causation
35
What is skew and what measures describe data when skew is and is not present
- Measures asymmetry in the distribution of data. Can be positive or negative - Skew present: describe data with median and IQR - Skew not present: mean and standard deviation
36
What is correlation? What is the key statistical test for correlation and how is the result interpreted? What are the assumptions for this test?
- Correlation is a statistical technique that measures the strength of an association. It does not tell you anything about the magnitude of the association. - Pearson's correlation coefficient is the key test. It gives you an r value which has a range of -1 to +1. -1 is very strong negative correlation, +1 is very strong positive correlation, 0 is no correlation. - r^2 often given too - it is the proportion of variance in x that is explained by y. Pearson's correlation assumptions: - Two variables should both be continuous numerical variables - Two variables should both be normally distributed - Assumption/plausibility of a linear relationship -Homoscedasticity - variances of data from line of best fit remains constant along line of best fit
37
What does linear regression tell us and when can it be used? What is the equation for simple linear regression? What are the assumptions?
- Commonly used in study settings where there is an assumed dose-response relationship. Linear regression tells us the magnitude of the response for any unit increase in dose. - Independent variable can be categorical or numerical, but dependent (outcome) variable must be continuous. Equation: Y = a +bx + error - a= intercept -b= regression coefficient (slope) Assumptions: - The distribution of y for each value of x is normal - The standard deviation of this normal distribution of y is the same for each x (homoscedasticity where errors are equal variance along line of best fit) -Mean values of the y distribution are linearly related to x - An additional assumption for multiple linear regression is that there is little or no correlation between independent variables
38
What is sensitivity and specificity and how are they calculated? What about positive and negative predictive values?
'Sense the guilty, specify the innocent' - Sensitivity: ability of a test to detect people with true disease (%) - Specificity: ability of a test to exclude individuals without the disease (%) - PPV: proportion of people with a positive test who have the disease (%) - NPV: proportion of people with a negative test who don't have the disease (%) The values for false positive and false negatives around found using a gold standard test. (sensitivity and specificity of a gold standard test is therefore 100%). Screening tests are generally not gold standard, but are used to get the right people to a gold standard test. When diseases are rare, PPV is not particularly good. PPV is far more dependent on specificity than sensitivity when the disease is rare. As screening usually tests for rare diseases, we usually prefer high specificities over high sensitivities to ensure a high enough PPV rate, to minimise any unecessary gold standard testing in the test positive group.
39
What is the difference between odds and risk?
The difference is in the denominator - odds are a simple ratio of outcome (A/B) vs. no outcome, whereas risk is the proportion of the total with the outcome (A/(A+B)) Case control studies only use odds ratios, as risk would simply be a function of the number of cases vs. controls in the study. Cohort studies and clinical trials can use either OR or RR
40
How do you calculate odds ratio?
odds of outcome in exposed/odds of outcome in unexposed
41
How do you calculate a risk ratio? (synonymous with relative risk)
risk of outcome in exposed/risk of outcome in unexposed
42
When to odds approximate to risks (probabilities)
When the exposure is rare
43
What is number needed to treat and how do you calculate it?
The number of whole patients (always round answer UP) that must be treated with a particular intervention to prevent one single outcome of interest. NNT = 1/ARR Wherre ARR = Absolute Risk Reduction (subtraction of absolute risk in the intervention group from the absolute risk in the control group) Number Needed to Harm is the same principle but involves calculating how many must be "treated" to cause 1 single negative outcome
44
How do you calculate: - Point prevalence -Period prevalence -Cumulative incidence -Incidence rate
- Point prevalence: Number of cases at a point in time/at risk population x100 - Period prevalence: total cases over a defined period of time (including initial cases at start point and any new cases by end point)/population studied x 100 - Cumulative incidence (incidence risk): number of new cases in a specified time/number of disease free people at beginning of time period x100. This is the same as risk. - Incidence rate: number of new cases in a given time/total person time at risk during the study period. Usually x100 and expressed as per 100 person years
45
What is the relationship between prevalence and incidence?
Prevalence depends on the incidence rate and the duration of disease. E.g if incidence of disease is low but duration is high, prevalence will be high relative to incidence (eg TB). But if incidence is high and duration is low, the prevalence will be low relative to the incidence. A change in the duration of a disease, for example the development of a new treatment which prevents death but does not result in a cure will lead to an increase in prevalence. A population in which the numbers of people with and without the disease remain stable is known as a steady-state population. In such (theoretical) circumstances, the point prevalence of disease is approximately equal to the product of the incidence rate and the mean duration of disease (i.e. length of time from diagnosis to recovery or death), providing that prevalence is less than about 0.11. That is Prevalence = Incidence x Duration
46
What is the attributable risk, attributable fraction and population attributable risk
- Attributable risk: The amount of risk in the exposed group that is due to the exposure. Attributable risk = Risk in exposed - risk in unexposed -Attributable fraction: The proportion of disease is exposed that can be considered attributable to the exposure after allowing for the risk of disease that would have occurred anyway. Attributable fraction = attributable risk/risk in exposed - Population attributable risk: Excess rate of disease in the whole study population (exposed and unexposed) that is attributable to the exposure. PAR = Risk in population - risk in unexposed - Population Attributable Risk Fraction: proportion of disease in the study population that us attributable to the exposure (and thus the proportion of disease that could be eliminated if the exposure were eliminated) PARF = PAR/overall risk in population
47
What is heterogeneity and what are the 3 main types?
- Heterogeneity is differences between observations, populations or studies. It is a particular issue for systematic reviews/meta-analyses and may preclude the pooling of results from different studies. 3 main types: - Statistical: the differences in reported effects between studies. May be explained by methodological or clinical heterogeneity. - Clinical: differences relating to patient characteristics, interventions or outcome measures. - Methodological: due to differences in study design
48
What are the two commonly used ways to test for statistical heterogeneity?
- Cochran's Q statistic: null hypothesis is that the studies are the same (therefore low p-value = high heterogeneity). Test has low power when there are few studies so a significance level of 10% is often used. - I^2 statistic: provides a measure of the degree of heterogeneity across studies by describing the percentage of total variation across studies that is due to heterogeneity rather than chance. Rough rule of thumb: 25%=low, 50%=moderate, 75%=high
49
If heterogeneity is found in a meta-analysis, what are two techniques that can be used?
- Random-effects meta-analysis - Meta-regression with weighted regression estimates (if another variable is found to explain the observed heterogeneity)
50
What is publication bias and how can you detect it in a meta-analysis?
- When the result of a study influences the likelihood of publication of that study. Happens because researchers are less likely to submit, and journals less likely to publish studies with inconclusive or negative results. Especially a problem for smaller studies. - Can use a funnel plot to detect it.
51
What are the key features of a funnel plot and what does it likely mean if it is asymmetrical at the bottom?
- Estimated treatment effect values (e.g log(odds ratio) (x axis) is plotted against a measure of their precision (e.g standard error or sample size) (y axis). As the sample size increases, precision increases giving the plot its funnel shape. - If asymmetrical at the bottom this can be due to small study effect (phenomenon of small trials tending to report larger treatment benefits than larger trials). If this exists, the summary effect measure of the meta-analysis will be overestimated. This can be caused by publication bias, however it could be due to other things like studies targeting high risk patients tending to be smaller, but the intervention effect can be larger.
52
Give examples of how to reduce publication bias
- Pre-register (e.g register of controlled trials) - Active discouragement of studies that do not have sufficient power to detect effects. - Publication of study protocols - Pre-print
53
What is the role of Bayes' Theorum? What is an advantage and a disadvantage?
- Incorporates prior beliefs into calculations of probability (posterior). it governs conditional probability calculations which are used for dependent events - Diagnostic test in a Bayesian framework: Posterior odds of disease = prior odds x likelihood ratio of a positive test result - Advantage: makes use of all available knowledge, therefore possibly more ethical - Disadvantage: Different users will obtain different conclusions if they choose different priors.
54
What is McNemar's X^2 test?
- Used for paired binary data (e.g repeated measures of the same variable in each participant or an individually matched case-control) - Presented in a 2x2 table that shows agreement (case and control both exposed) or discordance (one exposed and one not) - X^2 >3.84 corresponds to p <0.05. -
55
When would a Chi squared test for trend be used?
Used when data has ordered categorical exposure variables