Why summarise data
Summarising data using tables
Summarising data using graphs
What is a distribution?
Information about the data you have for one variable.
Properties of distributions
* What the central tendency is (mean, median or mode).
* How symmetrical the data is either side of the mean (skew).
* How variable the data is (e.g. data range, standard deviation and kurtosis).
* If it’s a “normal distribution”.
Central tendency (the average)
Skew (symmetry of distribution)
Positive skew: tail points to right or positively
Negative: tail points to left or negatively
Normal is symetrical
Kurtosis
Positive kurtosis
Leptokurtic: centre very high
Negative kurtosis
Platykurtic: centre very flat
Normal distribution
Mesokurtic: normal bell curve
Normal distribution
Symetrical bell curve where mean, mode and median are close
Variability
How spread out a set of data is.
Range
The range of a variable is the biggest value minus the smallest value. Vulnerable to extreme scores.
Interquartile range
The interquartile range (IQR) is like the range, but instead of the difference between the biggest and smallest value the difference between the 25th percentile and the 75th percentile is taken. Used a lot.
Mean absolute deviation
Mean absolute deviation is the mean of all of the absolute deviation scores of a data set. An absolute deviation is the difference between the score and the mean. Used sometimes.
Variance
The variance is the mean of the mean absolute deviation scores squared. Not used much.
Standard deviation
The square root of the variance. Used the most.
In general, you should expect 68% of the data to fall within 1 standard deviation of the mean, 95% of the data to fall within 2 standard deviation of the mean, and 99.7% of the data to fall within 3 standard deviations of the mean.
Standard score
raw score - mean, divided by standard deviation.
z-score
The position of a raw score in terms of its distance from the mean, when measured in standard deviation units.
The z-score is positive if the value lies above the mean, and negative if it lies below the mean. A z-score of 1 = 1SD away from the mean.
z-scores tell how an individual sits within a distribution.
What is a statistical model?
What is a good statistical model?
You can include so many variables that you “over fit” the
model to the data, which is problematic in terms of generalisation.
Why do we care about error?
It tells us that we don’t fully understand
our outcome/dependent variable.
It tells us there may be interesting factors
at play in the relationship/data under
investigation.
If we included covariates such as sex, we
may have a better chance of detecting a
significant effect.
standardisation
Standardisation is the process of converting scores on different scales to a common scale. Once the standardization is done, all the features will have a mean of zero and a standard deviation of one, and thus, the same scale.
Null hypothesis testing
Null hypothesis is the baseline against which we test our hypothesis of interest: that is, what would we expect the data to look like if there was no effect? The null hypothesis always involves some kind of equality, alternative inequality.
Importantly, null hypothesis testing operates under the assumption that the null hypothesis is true unless the evidence shows otherwise.
You should never make a decision about how to perform a hypothesis test once you have looked at the data, as this can introduce serious bias into the results.
The p value
Under the assumption that the null hypothesis is true, the p value is the probability of getting a sample as or more extreme as our own.
* Is a probability.
We should reject the null hypothesis if the p-value is less than 0.05.
Statistical significance
The alpha value