Mean of a data set
The mean of a data set in the sum of the observations divided by the number of observations.
Measures of central tendency or measures of Center
Descriptive measures that indicate where the Center or most typical value of a data set lies are called ^.
3 important measures of Center: mean, median, mode.
Median of a data set
Essentially, it is the number that divides the bottom 50% of the data from the top 50%.
Arrange the data in increasing order.
• If the number of observations is odd, then the median is the observation exactly in the middle of the ordered list.
• If the number of observations is even, then the median is the mean of the two middle observations in the ordered list.
In both cases, if we let n denote the number of observations, then the median is at position (n + 1)/2 in the ordered list.
Mode of a data set
Find the frequency of each value in the data set.
• If no value occurs more than once, then the data set has no mode.
• Otherwise, any value that occurs with the greatest frequency is a mode of the data set.
Resistant measure and trimmed mean
A resistant measure is not sensitive to the influence of a few extreme observations.
The median is a resistant measure of center, but the mean is not.
A trimmed mean can improve the resistance of the mean: removing a percentage of the smallest and largest observations before computing the mean gives a trimmed mean.
Summation notion
In statistics, as in algebra, letters such as x, y, and z are used to denote variables.
-We can often use notation for variables, along with other mathematical notations, to express statistics definitions and formulas concisely.
The sample mean
Values of variables for a sample from a pop = sample data. The mean of sample data is sample mean.
For a variable x, the mean of the observations for a sample is called a sample mean and is denoted ×(with line on top). Read as x bar.
Means of variation or measures of spread
Describe differences quantitatively, it indicates the amount of variation, or spread, in a data set.
2 frequently used measures: range and sample standard deviation
Range of a data set
The range of a data set is given by the formula
Range = Max - Min,
where Max and Min denote the maximum and minimum observations, respectively.
Takes into account only the largest and smallest observations. For that reason 2 other measures are favoured over. Standard deviation and interquartile range.
The sample standard deviation
Takes into account all the observations.
It is the preferred measure of variation when the mean is used as the measure of center.
Steps to computing standard deviation
Variation and the standard deviation
The more variation that there is in a data set, the larger is its standard deviation
3 standard deviations rules
Almost all the observations in any data set lie within three standard deviations to either side of the mean.
Quartiles
Certain percentiles are particularly important:
• the 10th, 20th, . .., 90th percentiles are called the deciles and divide a data set into tenths(10 equal parts);
• the 20th, 40th, 60th, and 80th percentiles are called the quintiles and divide a data set into fifths (five equal parts).
• The most commonly used percentiles other than the median are the quartiles, which are the 25th, 50th, and 75th percentiles, and divide a data set into quarters (four equal parts).
— Because of their importance, we use a special notation for the three guar-tiles, namely, Q1, Q2, and Q3.
Hence, roughly speaking,
• the first quartile, O1, is the number that divides the bottom 25% of the data from the top 75%;
• the second quartile, Q2, is the median, which, as you know, is the number that divides the bottom 50% of the data from the top 50%; and
• the third quartile, Os, is the number that divides the bottom 75% of the data from the top 25%.
Simpler explanation of Quartiles
To determine the Quartiles
Step 1 Arrange the data in increasing order.
Step 2 Find the median of the entire data set. This value is the second quartile, Q2.
Step 3 Divide the ordered data set into two halves, a bottom half and a top half; if the number of observations is odd, include the median in both halves.
Step 4 Find the median of the bottom half of the data set. This value is the first quartile, 01.
Step 5 Find the median of the top half of the data set. This value is the third quartile, Q3.
Step 6 Summarize the results.
The interquartile range
The interquartile range, or IQR, is the difference between the first and third
quartiles; that is, IOR = Q3 - Q1.
The 5 number summary
The 5 number summary of a data set is Min, Q1, Q2, Q3, Max.
Outliers
In data analysis, the identification of outliers observations that fall well outside the overall pattern of the data- -is important.
An outlier requires special attention. It may be the result of a measurement or recording error, an observation from a different population, or an unusual extreme observation.
Lower and upper limits (or fence)
The lower limit and upper limit of a data set are
Lower limit = Q1 - 1.5 * IQR
Upper limit = Q3 + 1.5 * IQR
- Observations that lie below the lower limit or above the upper limit are potential outliers.
Box plots
boxplot, also called a box-and-whisker diagram, is based on the five-number summary and can be used to provide a graphical display of the center and variation of a data set.
-To construct a boxplot, we also need the concept of adjacent values. The adiacent values of a data set are the most extreme observations that still lie within the lower and upper limits; they are the most extreme observations that are not potential outliers.
To construct a box plot
Step 1 Determine the quartiles.
Step 2 Determine potential outliers and the adjacent values.
Step 3 Draw a horizontal axis on which the numbers obtained in Steps 1 and 2 can be located. Above this axis, mark the quartiles and the adjacent values with vertical lines.
Step 4 Connect the quartiles to make a box, and then connect the box to the adjacent values with lines.
Step 5 Plot each potential outlier with an asterisk.
Population mean (mean of a variable)
First, we sum the observations of the variable for the sample, and then we divide by the size of the sample.
We can find the mean of a finite population similarly:
Note: For a particular variable on a particular population:
• There is only one population mean- namely, the mean of all possible observations of the variable for the entire population.
• There are many sample means one for each possible sample of the population.
Parameter and statistic
Parameter: A descriptive measure for a population
Statistic: A descriptive measure for a sample