Lesson 6 Flashcards by Zoe Antaya

What does a simple frequency table show? What is bad about it? What could be used instead? What type of data is it used for?

It shows how discrete data values compare to each other and to the entire sample. There is a column for the categories, then a column for the frequency that the category appears (how many people are in it) and then the percent, which is the frequency of that category divided by the total number of people.

It doesn’t visually compare numbers in an easy to understand manner. It is easier to display this with a pie or a bar chart.

This is representing nominal or ordinal (categorical) data.

How well did you know this?

Not at all

Perfectly

What does a bar chart show? What do gaps between the bars indicate?

A bar chart is used to visually show the frequencies of different categories. Gaps between the bars indicate that the data is discrete, meaning people have other be in one category or another, they can’t be in both or in neither or between categories. You can use this representation even when people answer multiple options, it just won’t add up to 100.

This uses frequencies as the y-axis!

How well did you know this?

Not at all

Perfectly

What does a pie chart show? When is it specifically used?

A pie chart is a physical distribution when they add up to 100 of the different categories making that sum. Percentages are in each section and then it is labelled at the bottom.

So this uses percentages not frequencies!

A pie chart is used when you are trying to emphasize relative proportions.

How well did you know this?

Not at all

Perfectly

What does a frequency table with cumulative percent in it show?

This type of frequency table is used for ordinal/discrete data that is only made up of integers. When it is ordered, you can take the percent of the total of that object and add it to the percent of the total of the previous objects, and this gives cumulative percent. Only makes sense to add this common the you have inherent order.

How well did you know this?

Not at all

Perfectly

What does a bar chart with spaces and breaks between bars indicate for ordinal data?

For ordinal data, this indicates that the values are discrete (you cannot occupy a value between the numbers so participants could only choose those whole numbers). The spaces then mean that no participants chose that number. So you can technically think of discrete data as categories as well.

How well did you know this?

Not at all

Perfectly

For continuous data, what do you do for the categories in a frequency table?

You would make the categories a range of values that are of equal size and equal distance between each other, and then you would find the number of values within each range. However, this can get rid of some data because you don’t know the individual data points.

How well did you know this?

Not at all

Perfectly

What is a histogram and what is it used for?

A histogram is basically a bar plot for continuous data that has been placed in ranges which are thought of as categories. It shows the frequencies of these classes by having frequency on the y axis and then the ranges (classes) on the x-axis. Because you can have values between the classes since it is continuous data, the bars are connected. This basically shows the distribution of the data without individually plotting each point. The bars need to indicate that they are the same width since you have the same range for each class. REDUCES TOTAL AMOUNT OF INFO BUT INCREASES UNDERSTANDING OF DISTRIBUTION OF SCORES.

How well did you know this?

Not at all

Perfectly

What should a normal distribution look like and what does a skewed distribution look like?

A normal distribution allows for parametric testing because it follows the expected distribution you would have for a large sample size (small amounts on outliers, large amounts in the middle where the most common value is). A skewed distribution is when you have a long tail, meaning you have a lot of extreme values and thus have a lot more scores on that extreme range then on the other side. Essentially the tail points in one direction and the curve overall is shifted in the other direction.

How well did you know this?

Not at all

Perfectly

Where is the mean pulled to in skewed data?

The mean will always be pulled towards the tail because those outliers (extreme values) will influence it and shift that average towards the values taking a lot of the weight.

How well did you know this?

Not at all

Perfectly

Where is the mean pulled to for a negative skew? What about a positive skew?

For a negative skew, the tail is pointing in the negative x (left direction), meaning that there are a lot of small outliers and the peak of the curve is shifted to the right. In this case, the mean is shifted to the left to represent what types of values are holding more weight. Mean is essentially adding all the values and then dividing by their expected weight. If they hold more weight then the mean will be shifted there, if they hold less weight it will be shifted the other way. So the mean is shifted to the left (Negative) side.

For a positive skew, the tail is pointing in the positive x direction, so the outliers are more on that side and so the peak is more to the left. The mean will also be shifted towards that skew because the outliers hold more weight then they should (in a normal curve) and this is reflected here.

How well did you know this?

Not at all

Perfectly

For a normal distribution, what part of the curve is the mean, which is the median, and which is the mode? The normal curve has the value on the x-axis (category) and the amount of participants (frequency) in that category on the y-axis.

The peak of the curve is going to be the mode, because that is the value which the most people are choosing. The mean is also going to be that value for a normal distribution since there is an equal deviation on either side and that will also cancel out producing the mean in the middle. The median will also be the peak as it is the value in the middle based on the ordered structure of the categories at the bottom.

How well did you know this?

Not at all

Perfectly

For a positive skew, where are the mean, median and mode? Why?

For a positive skew, the mean is going to be shifted the furthest right towards that that tail, because a lot of values are present on that tail and therefore have a large effect. The median will also be slightly shifted towards the tail because there are more values in that tail now and so the middle value is going to be closer to the tail then right at the peak. The mode will always be the peak because it is still just the value which is the most common among participant choices.

How well did you know this?

Not at all

Perfectly

For a negative skew, where are the mean, median and mode?

The mean is going to be the furthest left because there are more outliers that way and so the average value is going to be smaller than expected. The median will then be to the left slightly as well but not as much, as it splits the data in half and thus the halfway point is not at the peak due to the skew. The mode will then be the peak once again.

How well did you know this?

Not at all

Perfectly

What is the general rule for where the mean and median will go upon data skewing and why?

The mean will generally go the furthest towards the tail. Think of it as a seesaw, the further out you are from the middle the more you weight counts. The more of an outlier a data point is, the more it affects the mean. Because the mean is dividing all the actual values by the number of values. The number of values is basically assigning an equal weight to each value, and then the closer they are to that expected weight, the more it will cancel out. If very low from what is expected, this will drop down the average dramatically.

The median is simply just splitting the data in half. So if there is more outliers to one side that middle value is still in the middle it is just not where the peak is because those outliers are now closer to the middle value. Thus it is indirectly effected by the outliers and hence is not as strongly effected as mean is. For median, can also think of it as the area under the curve being split in half. The shorter height (tail) will required a longer distance until the area is the same as the taller height (peak) which is why although the median will be shifted from the peak, it will not be closer to the tail.

How well did you know this?

Not at all

Perfectly

What is the mode for categorical and continuous data?

For categorical data, it is the category that is chosen the most often. For continuous data, it is the class with the highest frequency, so not necessarily one value. This will also be the peak of the curve, as this has the largest frequency of participants choosing it. If there are multiple peaks that means it is multimodal, meaning more than one class occurs multiple times and in the same amount.

How well did you know this?

Not at all

Perfectly

What are measures of variability?

Study These Flashcards

These are measuring how accurate an average is, meaning how much does the sample it is representing vary from that average and thus how accurate is it really? It essentially shows how spread out a data set is.

What are the 4 measures of variability and which is each one reported with (median or mean)?

Study These Flashcards

Interquartile range
Deviation
Variance
Standard deviation

1 is reported with median and the rest are reported with mean.

When we talk about a range in statistics what is it referring to? How is it calculated?

Study These Flashcards

It is talking about how large a spread the values are - so like the range is 5, meaning between the largest and smallest values it is 5 units. We are not usually talking about the range as in 4-10 (including highest and lowest values).

Range is calculated by doing:
(largest value) - (smallest value)
So this is the length of the smallest interval that contains all the data.

What is the issue with range?

Study These Flashcards

Range is sensitive to a small sample size, meaning that there is less range in a small sample size and so you are less likely to get outliers that are important to the data set.

What is the interquartile range? What are quartiles?

Study These Flashcards

Quartiles are essentially categories that split the data into 4 groups. The bottom 25% of the data, then the middle quarters, and then the top 25%. The 25th percentile is Q1. 50% of scores fall below Q2, 50% go above.

Interquartile range is the range between values in the 25th quartile up until the 75th percentile. This collects data from the inner half of the data and thus looks at the general values in the data while excluding outliers. This provides the range of typical values rather than the range of all values.

What measure of central tendency Q2?

Study These Flashcards

It is the median (the middle value in the data).

What does the IQR allow us to do?

Study These Flashcards

The IQR allows us to calculate boundaries in the data and anything outside those boundaries are outliers.

How do you calculate outliers using the IQR?

Study These Flashcards

Outliers are extreme values that do not fit within the data. They are calculated using IQR with the following formulas:

The upper boundary will be the value at Q3 plus 1.5xIQR. This is called the upper boundary. It does not represent a point in the data set but is instead a division of outliers and non-outliers.
The lower boundary will by Q1 minus 1.5xIQR.

What is a boxplot and what do the different parts of it represent?

Study These Flashcards

A boxplot is a way to represent the data such that majority of the data is indicated and then outliers are distinguished.

The box represents the IQR, so the top of the box is Q3 and the bottom is Q1, thus it is showing the most typical values. The line in the box plot is the median (middle value of the data set). These all correspond to actual values, because the y-axis is displaying the numerical variable. The x-axis then indicates which variable the box is representing.

The two whiskers indicate the spread of the data outside of the IQR. So this includes the bottom 25th percentile and top 25th percentile. The end of the whisker is the largest value which is below the fence for the top whisker, and the smallest value which is above the fence for the bottom whisker.

The fences are not actual values, they are jut thresholds to indicate where those tails should end and what an outlier is.

So the box = 50% of the data (IQR).

How can you determine the skew of the data by looking at a box plot?

You can draw out the curve on the side and see if the tail is long. Also if one of the whiskers is longer than the other and the black line (median) is shifted aware from that whisker, it is likely skewed in the direction of the extended whisker. If it is just slight, then you can't say it is strongly skewed.

What is deviation? How do you calculate it for the population and for the sample?

The difference between each score and the mean of the data set. So how far each score deviates from that mean. For the population, it is the value (xi) - the population mean (u) so: xi - u If negative, then the value is smaller then the mean and has a negative deviation, if positive it is larger than the mean and has a positive deviation. Meu is used for population mean and N is used for total number in the population. For the sample it is the value (xi) - the sample mean (x). so: xi - x (with a line overtop) X with a line over it represents sample mean and then n represents sample size.

How do you find the deviation for each data entry?

1. Find the mean score of all entries 2. Calculate the deviation of each entry from that mean and then put it in a table.

Deviation scores should always sum to ________.

Deviation scores should always sum to 0. Because all the negative and positive deviations cancel out.

What is the issue with the table of deviations and what can be done to counteract this?

This table is hard to read and doesn't sum up what we are trying to convey simply. We would want to take an average of the devotions to represent this but we can't because it adds to zero. So instead we take the standard deviation, which utilizes squaring and square rooting to get absolute values and thus represent the average deviation of all points.

What is the difference between sample variance and standard deviation?

The sample variance is essentially a number used to represent the average amount of variation in a set of scores, but it is squared which magnifies the data. Essentially the standard deviation is then just the sample variance square rooted.

How do you calculate sample variance? What about population variance?

Sample variance: Find the difference of all data points from the average and square all these differences to magnify their effect. Then sum them and divide by the number of individuals in the sample minus one. We subtract 1 so that we don't underestimate the variance. Because the larger the denominator the smaller the variance. So by subtracting one it makes the denominator smaller and thus makes us overestimate that variance. For population mean its all the same but you only divide by N, not n-1. This is rarely used though. 1. Find mean of data 2. Find deviation of each data point from mean 3. Square these deviations 4. Sum the squares 5. Divide by n-1 (number of people in sample -1) to get sample variance!

How do you calculate standard deviation? How does this calculation relate to the sample variance?

The standard deviation calculation is essentially just the undoing of the squaring in the sample variance that magnified all the values initially. So once you find the sample variance you just square root it.

What does it mean if the standard deviation is larger? What about if it is smaller? What does it not show?

If the standard deviation is larger, then the average deviation of values from the mean is larger, and thus the data is more spread out. However it does not indicate how that data is spread out (is there a tail, it is bimodal, etc.)

For normally distributed data, what does SD mean?

It means that 2/3 of the data is within +1 and -1 standard deviations of the mean.

How is the deviation different from the IQR and box plots?

Mean is always reported with standard deviation so it shows the deviation around the mean rather then around the median (as is done with box plots).

Lesson 6 Flashcards

(35 cards)