What is the arithmetic mean and its advantages and disadvantages?
Arithmetic mean = sum of all sample values/sample size
Advantage: uses all values in the data, so statistically efficient
Disadvantage: vulnerable to outliers
What is the median and its advantages and disadvantages?
Median = list all the observations in order. Median is the middle value (for an odd number of observations), or the average of the two middle observations (for an even number of observations)
Advantage: not vulnerable to outliers
Disadvantage: not statistically efficient as it does not make use of all the individual data values
What is the mode? When might it be used?
Most common value. Not used much in statistical analysis. Can be used for categorical data to describe the most frequent category.
What is the range and IQR? Why is IQR useful?
Range: smallest to largest value
IQR: 25% to 75%. Can be obtained from a list of values by first obtaining the median and then halving the two halves to get LQ and UQ.
Advantage of IQR is that it is not vulnerable to outliers.
What is the standard deviation and variance?
The variance is the average squared deviation of each number from the sample mean. It is a statistical measure of spread.
The standard deviation is the square root of the variance. It is also a statistical measure of spread around the mean.
Why is the standard deviation useful? When should it not be used?
NB: the standard deviation is useful in medicine because for data that follow a normal distribution 95% of observations will be within 2 SDs of the mean, known as the reference range.
SD should not be used for skewed data. IQR should be used instead.
Define sensitivity and specificity
Sensitivity = true positives/all those who have the disease - P(T+|D+)
Specificity = true negatives/all those who do not have the disease - P(T-|D-)
NB these are NOT affected by prevalence since it is a characteristic of the test, not the population. Although for rare diseases can be tricky to accurately measure sensitivity.
Define false negative rate and false positive rate
False negative rate = 1 - sensitivity
False positive rate = 1 - specificity
Outline the rules of elementary probability theorem
P(A or B) when mutually exclusive = P(A) + P(B)
P(A or B) when not mutually exclusive = P(A) + P(B) – P(A and B)
P(A and B) = P(A) x P(B|A) = P(B) x P(A|B) (this is conditional probability e.g. probability of getting neuropathy if someone is diabetes)
P(A and B) = P(A) x P(B) (this is for independent events e.g. probability of being blood group O and getting diabetes - assuming they are completely unrelated)
What is the formula for Bayes’ Theorem?
P(A) x P(B|A) = P(B) x P(A|B)
Can be rearranged to give formula for Bayes’ theorem
P (B|A) = [ P(A|B) x P(B) ] / P(A)
Thus, the probability of B given A is the probability of A given B, times the probability of B divided by the probability of A.
This is derived from the multiplication rule for conditional probability.
Define PPV and NPV
PPV = true positive/all those who tested positive
NPV = true negatives/all those who tested negative
NB these values ARE affected by prevalence
How does Bayes’ theorem relate to a diagnostic/screening test?
From the multiplication rule: P(T+ and D+) = P(T+|D+) x P(D+)
P(T+|D+) is the sensitivity of the test and P(D+) is the prevalence
Diagnostic process can be summarised by Bayes’ Theorem in this way:
P(D+|T+) = [ P(T+|D+) x P(D+) ] / P(T+)
P(D+) is the a priori probability.
P(D+|T+) is the a posteriori probability (aka the probability based on empirical evidence/observation)
How can Bayes’ theorem be summarised through odds?
Bayes theorem can be summarised by:
Odds of disease after test = odds of disease before test x likelihood ratio
Odds before test = The probability that an individual has coronary heart disease, before testing, is 0.70, and so the odds are 0.70/(1-0.70)=2.33 (which can also be written as 2.33:1).
Positive likelihood ratio = Sensitivity / (1 - specificity)
The usefulness of a diagnostic/screening test will depend upon the prevalence of the disease in the population to which it has been applied. In general, a useful test is one which considerably modifies the pre-test probability. If the disease is very rare or very common, then the probabilities of disease given a negative or positive test are relatively close and so the test is of questionable value.
How can probabilities be used to determine if events are independent or not?
If the test and having the diagnosis were completely independent we would expect: P(D+ and T+ )= P(T+) x P(D+).
Therefore we can use this formula [P(D+ and T+)] – [ P(D+) x P(T+)] as a crude estimate of whether events are independent. If there is a difference P(D+ and T+) and P(D+) x P(T+), this suggests the events are NOT independent.
What is the difference between independent and mutually exclusive events?
Mutually exclusive events cannot happen together (P(A and B) = 0), like flipping heads or tails on one coin, while independent events do not affect each other’s probability, like flipping two coins (P(A and B) = P(A) * P(B)
What is sampling error?
The uncertainty, caused by observing a sample rather than the whole population.
Generally, the larger the sample, the smaller the sampling error.
What is the standard error?
The standard error estimates how precisely a population parameter (e.g. mean, difference between means, proportion) is estimated by the equivalent statistic in a sample. (The standard error is a way of measuring the likely sampling error)
The standard error is the standard deviation of the sampling distribution of the statistic.
With normally distributed values and/or large samples, 1.96 SEs around the sample mean produce a range of values which will include the true mean with 95% confidence.
How do you calculate the standard error of the mean and a confidence interval for the SEM?
How do you calculate the SE for a difference in means and the CI for this?
Standard error of the mean = SD/ √𝑛
95% CI = sample mean ± (1.96 x SE)
See this page for formula for SE of difference in means: https://www.healthknowledge.org.uk/index.php/public-health-textbook/research-methods/1b-statistical-methods/methods-quantification-uncertainty
95% CI for difference in means = (mean1 - mean2 ) ± (1.96 x SE of difference in means)
How do you calculate the standard error of proportion/percentage?
How do you calculate the CI for a proportion/percentage?
How do you calculate the standard error for a difference in proportions?
How do you calculate the CI for a difference in proportions?
95% CI for a proportion = proportion ± (1.96 x SE)
95% CI for a difference in proportions = (p1 - p2) ± (1.96 x SE for the difference)
See this page for SE formulae: https://www.healthknowledge.org.uk/index.php/public-health-textbook/research-methods/1b-statistical-methods/methods-quantification-uncertainty
What values are used for 90% and 99% CIs?
90% = 1.65
99% = 2.58
Describe interpretation/comparison of two 95% confidence intervals
95% confidence intervals do not overlap: Significant difference at the 5% significance level (i.e. strong evidence of a true difference).
95% confidence intervals overlap but the point estimates are outside the confidence intervals of the other: Unclear - Requires calculation of a significance test.
Point estimate of one sample falls within the 95% confidence intervals of the other: No significant difference at the 5% significance level (i.e. no strong evidence of a true difference).
What is the normal distribution?
Normal distribution describes continuous data which have a symmetric distribution, with a characteristic ‘bell’ shape.
Described by μ the population mean, or centre of the distribution, and σ the population standard deviation. It is symmetrically distributed about the mean
What are the uses of the normal distribution?
What is the binomial distribution and its uses?
Constructed from sample size n and constant true probability π.
It shows the frequency of events with two possible outcomes e.g. success and failure. In this example π would be the treatment success rate.
The distribution will depict the probability of different numbers of successes.
Will approximate normal distribution for large samples.
Uses:
- Discrete data with two outcomes
- Sampling distribution of proportions
NB since a proportion or probability cannot be negative, binomial will not have negative values