Distribution
Summarized listing of distinct data values
3 Types of graphs to analyze distribution of value for numerical variables
Dot Plots and how to construct
Graph where each observation is plotted as a dot at an appropriate place above a horizontal axis. Good for small number of observations
Step 1: Draw a horizontal axis that displays the possible values of the quantitative data. Label the axis with the variable name.
Step 2: Record each observation by placing a dot over the appropriate value on the horizontal axis.
Stem Plots and how to build
A graph where each observation is separated into two parts. The leaf is the right most digit. The stem is everything else.
Histogram
Uses bars to represent the frequency (counts) or relative frequency of the observations falling into particular intervals (bins). Used for continuous data and large datasets.
Method of Left Inclusion
For each interval in a histogram, we use square brackets on the left end point and round brackets mean the number is NOT included in the interval.
Ex) Interval A: [25.5, 27.0)
Interval B: [27.0, 28.5)
Only interval B includes the number 27
Skewness
Distribution is asymmetric and can be right-skewed or left-skewed
Modality
This is the number of peaks in a distribution. A peak has the largest frequency in a distribution. A distribution can be unimodal, bimodal, or multimodal
Uniform Distribution
A special, symmetric distribution where all possible values are observed equally with no peak.
Normal Distribution
aka the Bell Curve that is unimodal and symmetric
The Centre of Distribution
The value that is most likely to occur (mean, median, and mode)
Frequency
The number of times that a particular distinct value of a variable occurred in a sample
Contingency Table
tables that summarize the information of two (bivariate) categorical variables and help answer questions related to the relationship between the two variables.
When do you use mean?
When the histogram has no skewness or the data has no outliers
When do you use median?
When the histogram has skewness or the data has outliers
When do you use the mode? And what is the mode?
The frequency of the distinct value.
When the data is categorical, but if no value occurs more than once, then the data set has no mode. Two modes are possible if there are two values that show up an equal amount and other values show up less.
“The Spread”
variability of a distribution
How do you calculate range?
Range = max(x1, . . . , xn) - min(x1, . . . ., xn)
aka range = biggest value - smallest value
Lower Quartile
Q1 is the 25th percentile that separates the bottom 25% of the data from the top 75%
Middle Quartile
Q2 is the 50th percentile that splits the data in half (median)
Upper Quartile
Q3 is the 75th percentile that separates the bottom 75% of the data from the top 25%
Inter-Quartile Range (IQR)
IGQ is the difference between the upper quartile and the lower quartile
IQR = Q3 - Q1
How do you know if a value is an outlier?
If the value falls outside the interval [upper limit, lower limit]
Upper limit = Q3 + 1.5xIQR
Lower limit = Q1 - 1.5xIQR
Random Variables
a measurable characteristic that varies from one member to another whose observed value depends on chance