What does a simple frequency table show? What is bad about it? What could be used instead? What type of data is it used for?
It shows how discrete data values compare to each other and to the entire sample. There is a column for the categories, then a column for the frequency that the category appears (how many people are in it) and then the percent, which is the frequency of that category divided by the total number of people.
It doesn’t visually compare numbers in an easy to understand manner. It is easier to display this with a pie or a bar chart.
This is representing nominal or ordinal (categorical) data.
What does a bar chart show? What do gaps between the bars indicate?
A bar chart is used to visually show the frequencies of different categories. Gaps between the bars indicate that the data is discrete, meaning people have other be in one category or another, they can’t be in both or in neither or between categories. You can use this representation even when people answer multiple options, it just won’t add up to 100.
This uses frequencies as the y-axis!
What does a pie chart show? When is it specifically used?
A pie chart is a physical distribution when they add up to 100 of the different categories making that sum. Percentages are in each section and then it is labelled at the bottom.
So this uses percentages not frequencies!
A pie chart is used when you are trying to emphasize relative proportions.
What does a frequency table with cumulative percent in it show?
This type of frequency table is used for ordinal/discrete data that is only made up of integers. When it is ordered, you can take the percent of the total of that object and add it to the percent of the total of the previous objects, and this gives cumulative percent. Only makes sense to add this common the you have inherent order.
What does a bar chart with spaces and breaks between bars indicate for ordinal data?
For ordinal data, this indicates that the values are discrete (you cannot occupy a value between the numbers so participants could only choose those whole numbers). The spaces then mean that no participants chose that number. So you can technically think of discrete data as categories as well.
For continuous data, what do you do for the categories in a frequency table?
You would make the categories a range of values that are of equal size and equal distance between each other, and then you would find the number of values within each range. However, this can get rid of some data because you don’t know the individual data points.
What is a histogram and what is it used for?
A histogram is basically a bar plot for continuous data that has been placed in ranges which are thought of as categories. It shows the frequencies of these classes by having frequency on the y axis and then the ranges (classes) on the x-axis. Because you can have values between the classes since it is continuous data, the bars are connected. This basically shows the distribution of the data without individually plotting each point. The bars need to indicate that they are the same width since you have the same range for each class. REDUCES TOTAL AMOUNT OF INFO BUT INCREASES UNDERSTANDING OF DISTRIBUTION OF SCORES.
What should a normal distribution look like and what does a skewed distribution look like?
A normal distribution allows for parametric testing because it follows the expected distribution you would have for a large sample size (small amounts on outliers, large amounts in the middle where the most common value is). A skewed distribution is when you have a long tail, meaning you have a lot of extreme values and thus have a lot more scores on that extreme range then on the other side. Essentially the tail points in one direction and the curve overall is shifted in the other direction.
Where is the mean pulled to in skewed data?
The mean will always be pulled towards the tail because those outliers (extreme values) will influence it and shift that average towards the values taking a lot of the weight.
Where is the mean pulled to for a negative skew? What about a positive skew?
For a negative skew, the tail is pointing in the negative x (left direction), meaning that there are a lot of small outliers and the peak of the curve is shifted to the right. In this case, the mean is shifted to the left to represent what types of values are holding more weight. Mean is essentially adding all the values and then dividing by their expected weight. If they hold more weight then the mean will be shifted there, if they hold less weight it will be shifted the other way. So the mean is shifted to the left (Negative) side.
For a positive skew, the tail is pointing in the positive x direction, so the outliers are more on that side and so the peak is more to the left. The mean will also be shifted towards that skew because the outliers hold more weight then they should (in a normal curve) and this is reflected here.
For a normal distribution, what part of the curve is the mean, which is the median, and which is the mode? The normal curve has the value on the x-axis (category) and the amount of participants (frequency) in that category on the y-axis.
The peak of the curve is going to be the mode, because that is the value which the most people are choosing. The mean is also going to be that value for a normal distribution since there is an equal deviation on either side and that will also cancel out producing the mean in the middle. The median will also be the peak as it is the value in the middle based on the ordered structure of the categories at the bottom.
For a positive skew, where are the mean, median and mode? Why?
For a positive skew, the mean is going to be shifted the furthest right towards that that tail, because a lot of values are present on that tail and therefore have a large effect. The median will also be slightly shifted towards the tail because there are more values in that tail now and so the middle value is going to be closer to the tail then right at the peak. The mode will always be the peak because it is still just the value which is the most common among participant choices.
For a negative skew, where are the mean, median and mode?
The mean is going to be the furthest left because there are more outliers that way and so the average value is going to be smaller than expected. The median will then be to the left slightly as well but not as much, as it splits the data in half and thus the halfway point is not at the peak due to the skew. The mode will then be the peak once again.
What is the general rule for where the mean and median will go upon data skewing and why?
The mean will generally go the furthest towards the tail. Think of it as a seesaw, the further out you are from the middle the more you weight counts. The more of an outlier a data point is, the more it affects the mean. Because the mean is dividing all the actual values by the number of values. The number of values is basically assigning an equal weight to each value, and then the closer they are to that expected weight, the more it will cancel out. If very low from what is expected, this will drop down the average dramatically.
The median is simply just splitting the data in half. So if there is more outliers to one side that middle value is still in the middle it is just not where the peak is because those outliers are now closer to the middle value. Thus it is indirectly effected by the outliers and hence is not as strongly effected as mean is. For median, can also think of it as the area under the curve being split in half. The shorter height (tail) will required a longer distance until the area is the same as the taller height (peak) which is why although the median will be shifted from the peak, it will not be closer to the tail.
What is the mode for categorical and continuous data?
For categorical data, it is the category that is chosen the most often. For continuous data, it is the class with the highest frequency, so not necessarily one value. This will also be the peak of the curve, as this has the largest frequency of participants choosing it. If there are multiple peaks that means it is multimodal, meaning more than one class occurs multiple times and in the same amount.
What are measures of variability?
These are measuring how accurate an average is, meaning how much does the sample it is representing vary from that average and thus how accurate is it really? It essentially shows how spread out a data set is.
What are the 4 measures of variability and which is each one reported with (median or mean)?
1 is reported with median and the rest are reported with mean.
When we talk about a range in statistics what is it referring to? How is it calculated?
It is talking about how large a spread the values are - so like the range is 5, meaning between the largest and smallest values it is 5 units. We are not usually talking about the range as in 4-10 (including highest and lowest values).
Range is calculated by doing:
(largest value) - (smallest value)
So this is the length of the smallest interval that contains all the data.
What is the issue with range?
Range is sensitive to a small sample size, meaning that there is less range in a small sample size and so you are less likely to get outliers that are important to the data set.
What is the interquartile range? What are quartiles?
Quartiles are essentially categories that split the data into 4 groups. The bottom 25% of the data, then the middle quarters, and then the top 25%. The 25th percentile is Q1. 50% of scores fall below Q2, 50% go above.
Interquartile range is the range between values in the 25th quartile up until the 75th percentile. This collects data from the inner half of the data and thus looks at the general values in the data while excluding outliers. This provides the range of typical values rather than the range of all values.
What measure of central tendency Q2?
It is the median (the middle value in the data).
What does the IQR allow us to do?
The IQR allows us to calculate boundaries in the data and anything outside those boundaries are outliers.
How do you calculate outliers using the IQR?
Outliers are extreme values that do not fit within the data. They are calculated using IQR with the following formulas:
The upper boundary will be the value at Q3 plus 1.5xIQR. This is called the upper boundary. It does not represent a point in the data set but is instead a division of outliers and non-outliers.
The lower boundary will by Q1 minus 1.5xIQR.
What is a boxplot and what do the different parts of it represent?
A boxplot is a way to represent the data such that majority of the data is indicated and then outliers are distinguished.
The box represents the IQR, so the top of the box is Q3 and the bottom is Q1, thus it is showing the most typical values. The line in the box plot is the median (middle value of the data set). These all correspond to actual values, because the y-axis is displaying the numerical variable. The x-axis then indicates which variable the box is representing.
The two whiskers indicate the spread of the data outside of the IQR. So this includes the bottom 25th percentile and top 25th percentile. The end of the whisker is the largest value which is below the fence for the top whisker, and the smallest value which is above the fence for the bottom whisker.
The fences are not actual values, they are jut thresholds to indicate where those tails should end and what an outlier is.
So the box = 50% of the data (IQR).