Statistics Flashcards

Question

What is the mean? DEAD

Answer 1

- D - the arithmetic mean is the average value based on the sum of a set of numbers (geometric mean is based on product) - E - mean = (sum of values)/n - A - good for statistical analysis - D - Sensitive to outliers and poor for asymmetrical distributions

Answer 2

- D - Value at the centre of the distribution - E - half of observations lie above and half below - A - More appropriate for skewed distributions as not as sensitive to outliers as mean. - D - value determined solely by rank so provides no information about any other values

Answer 3

- D - Most commonly occurring value - E - value occurring with highest frequency - A - Not greatly affected by outliers and can provide additional info (e.g suicide is bimodal, affecting young and old adults) - D - not amenable to statistical analysis. Sometimes there is no mode, or if there is more than one mode this can be difficult to interpret.

Answer 4

- D - Values are ranked and divided into 100 groups - E - 50th percentile is median, 25th is lower quartile and 75th is upper quartile. - A - useful for comparing measurements (e.g BMI in groups of similar age and sex) - D - comparisons made at extreme ends of distribution are often less informative than those at the centre

Answer 5

- D - the difference between the maximum and minimum sample values - U - same as data - A - Intuitive and only needs two observations - D - sensitive to size of sample and outliers

Answer 6

- D - Middle 50% of the sample (difference between upper and lower quartile) - U - same as data - A - More stable than the range as sample size increases - D - Unstable for small samples and does not allow for further mathematical manipulation

Answer 7

- D - Average squared deviation of each number from its mean - U - squared units of data - A - Takes into account all values. useful for making inferences about the population - D - Units different from units in the data

Answer 8

- D - Square root of the variance - U - same as data - A - Units the same as in the data. Useful for making inferences. - D - sensitive to some extent to extreme values and will vary depending on units of observation.

Answer 9

- D - Ratio of the standard deviation to the mean to give an idea of the size of variation relative to the size of observation. - U - no units - A - allows comparison of the variation of populations that have significantly different mean values - D - where mean value is near zero, the coefficient of variation is highly sensitive to changes in the standard deviation

Answer 10

- True association - Chance - Bias - Confounding - Reverse causation

Answer 11

- Measures asymmetry in the distribution of data. Can be positive or negative - Skew present: describe data with median and IQR - Skew not present: mean and standard deviation

Answer 12

- Correlation is a statistical technique that measures the strength of an association. It does not tell you anything about the magnitude of the association. - Pearson's correlation coefficient is the key test. It gives you an r value which has a range of -1 to +1. -1 is very strong negative correlation, +1 is very strong positive correlation, 0 is no correlation. - r^2 often given too - it is the proportion of variance in x that is explained by y. Pearson's correlation assumptions: - Two variables should both be continuous numerical variables - Two variables should both be normally distributed - Assumption/plausibility of a linear relationship -Homoscedasticity - variances of data from line of best fit remains constant along line of best fit

Answer 13

- Commonly used in study settings where there is an assumed dose-response relationship. Linear regression tells us the magnitude of the response for any unit increase in dose. - Independent variable can be categorical or numerical, but dependent (outcome) variable must be continuous. Equation: Y = a +bx + error - a= intercept -b= regression coefficient (slope) Assumptions: - The distribution of y for each value of x is normal - The standard deviation of this normal distribution of y is the same for each x (homoscedasticity where errors are equal variance along line of best fit) -Mean values of the y distribution are linearly related to x - An additional assumption for multiple linear regression is that there is little or no correlation between independent variables

Answer 14

'Sense the guilty, specify the innocent' - Sensitivity: ability of a test to detect people with true disease (%) - Specificity: ability of a test to exclude individuals without the disease (%) - PPV: proportion of people with a positive test who have the disease (%) - NPV: proportion of people with a negative test who don't have the disease (%) The values for false positive and false negatives around found using a gold standard test. (sensitivity and specificity of a gold standard test is therefore 100%). Screening tests are generally not gold standard, but are used to get the right people to a gold standard test. When diseases are rare, PPV is not particularly good. PPV is far more dependent on specificity than sensitivity when the disease is rare. As screening usually tests for rare diseases, we usually prefer high specificities over high sensitivities to ensure a high enough PPV rate, to minimise any unecessary gold standard testing in the test positive group.

Answer 15

The difference is in the denominator - odds are a simple ratio of outcome (A/B) vs. no outcome, whereas risk is the proportion of the total with the outcome (A/(A+B)) Case control studies only use odds ratios, as risk would simply be a function of the number of cases vs. controls in the study. Cohort studies and clinical trials can use either OR or RR

Answer 16

odds of outcome in exposed/odds of outcome in unexposed

Answer 17

risk of outcome in exposed/risk of outcome in unexposed

Answer 18

When the exposure is rare

Answer 19

The number of whole patients (always round answer UP) that must be treated with a particular intervention to prevent one single outcome of interest. NNT = 1/ARR Wherre ARR = Absolute Risk Reduction (subtraction of absolute risk in the intervention group from the absolute risk in the control group) Number Needed to Harm is the same principle but involves calculating how many must be "treated" to cause 1 single negative outcome

Answer 20

- Point prevalence: Number of cases at a point in time/at risk population x100 - Period prevalence: total cases over a defined period of time (including initial cases at start point and any new cases by end point)/population studied x 100 - Cumulative incidence (incidence risk): number of new cases in a specified time/number of disease free people at beginning of time period x100. This is the same as risk. - Incidence rate: number of new cases in a given time/total person time at risk during the study period. Usually x100 and expressed as per 100 person years

Answer 21

Prevalence depends on the incidence rate and the duration of disease. E.g if incidence of disease is low but duration is high, prevalence will be high relative to incidence (eg TB). But if incidence is high and duration is low, the prevalence will be low relative to the incidence. A change in the duration of a disease, for example the development of a new treatment which prevents death but does not result in a cure will lead to an increase in prevalence. A population in which the numbers of people with and without the disease remain stable is known as a steady-state population. In such (theoretical) circumstances, the point prevalence of disease is approximately equal to the product of the incidence rate and the mean duration of disease (i.e. length of time from diagnosis to recovery or death), providing that prevalence is less than about 0.11. That is Prevalence = Incidence x Duration

Answer 22

- Attributable risk: The amount of risk in the exposed group that is due to the exposure. Attributable risk = Risk in exposed - risk in unexposed -Attributable fraction: The proportion of disease is exposed that can be considered attributable to the exposure after allowing for the risk of disease that would have occurred anyway. Attributable fraction = attributable risk/risk in exposed - Population attributable risk: Excess rate of disease in the whole study population (exposed and unexposed) that is attributable to the exposure. PAR = Risk in population - risk in unexposed - Population Attributable Risk Fraction: proportion of disease in the study population that us attributable to the exposure (and thus the proportion of disease that could be eliminated if the exposure were eliminated) PARF = PAR/overall risk in population

Answer 23

- Heterogeneity is differences between observations, populations or studies. It is a particular issue for systematic reviews/meta-analyses and may preclude the pooling of results from different studies. 3 main types: - Statistical: the differences in reported effects between studies. May be explained by methodological or clinical heterogeneity. - Clinical: differences relating to patient characteristics, interventions or outcome measures. - Methodological: due to differences in study design

Answer 24

- Cochran's Q statistic: null hypothesis is that the studies are the same (therefore low p-value = high heterogeneity). Test has low power when there are few studies so a significance level of 10% is often used. - I^2 statistic: provides a measure of the degree of heterogeneity across studies by describing the percentage of total variation across studies that is due to heterogeneity rather than chance. Rough rule of thumb: 25%=low, 50%=moderate, 75%=high

Answer 25

- Random-effects meta-analysis - Meta-regression with weighted regression estimates (if another variable is found to explain the observed heterogeneity)

Answer 26

- When the result of a study influences the likelihood of publication of that study. Happens because researchers are less likely to submit, and journals less likely to publish studies with inconclusive or negative results. Especially a problem for smaller studies. - Can use a funnel plot to detect it.

Answer 27

- Estimated treatment effect values (e.g log(odds ratio) (x axis) is plotted against a measure of their precision (e.g standard error or sample size) (y axis). As the sample size increases, precision increases giving the plot its funnel shape. - If asymmetrical at the bottom this can be due to small study effect (phenomenon of small trials tending to report larger treatment benefits than larger trials). If this exists, the summary effect measure of the meta-analysis will be overestimated. This can be caused by publication bias, however it could be due to other things like studies targeting high risk patients tending to be smaller, but the intervention effect can be larger.

Answer 28

- Pre-register (e.g register of controlled trials) - Active discouragement of studies that do not have sufficient power to detect effects. - Publication of study protocols - Pre-print

Answer 29

- Incorporates prior beliefs into calculations of probability (posterior). it governs conditional probability calculations which are used for dependent events - Diagnostic test in a Bayesian framework: Posterior odds of disease = prior odds x likelihood ratio of a positive test result - Advantage: makes use of all available knowledge, therefore possibly more ethical - Disadvantage: Different users will obtain different conclusions if they choose different priors.

Answer 30

- Used for paired binary data (e.g repeated measures of the same variable in each participant or an individually matched case-control) - Presented in a 2x2 table that shows agreement (case and control both exposed) or discordance (one exposed and one not) - X^2 >3.84 corresponds to p <0.05. -

Answer 31

Used when data has ordered categorical exposure variables

Statistics Flashcards

(55 cards)