Statistics
The study of the collection, analysis, and interpretation of data
Econometrics
A branch of economics that uses statistics to analyze economic problems
A/B Testing
A way to compare two versions of something to find out which version performs better
Sample
Subset of a larger population
Inferential Statistics
Allow data professionals to make inferences about a dataset based on a sample of the data (i.e., use existing data to predict outcomes, e.g., how the next 99k users will behave based on how the first 1k users behaved.)
A/B testing can predict with 100% certainty, there will also be a ___
confidence interval
Confidence Interval
A range of values that describes the uncertainty surrounding an estimate
Statistical Significance
The claim that the results of a test or experiment are not explainable by chance alone
A/B Testing Steps
-Analyze a small group of users
-Decide on the sample size
-Determine the statistical significance
Descriptive Statistics
Describe or summarize the main features of a dataset
Useful because they let you understand a large amount of data quickly.
Example: You have the heights of 10M people.
If you summarize the data (i.e., find the mean or median height) you have useful knowledge about the data. Better than starting at 10M rows of data.
2 Common Types of Descriptive Statistics
-Visuals likes graphs and tables
-Summary Stats: let you summarize your data using a single number (e.g., mean or average value)
2 Main Types of Summary Stats
1) Measures of Central Tendency: Describe the center of your dataset (e.g., the mean)
2) Measures of Dispersion: Describe the spread of your dataset or the amount of variation in you data points (e.g., standard deviation: a measure of how dispersed the data is in relation to the mean. )
Statistical Population
Every possible element that you are interested in measuring
A statistical population may refer to people, objects, or events.
For example:
Set of all residents in a country
Set of all planets in our solar system
Set of all the outcomes in a 1k coin flips
So samples could be residents, planets, or coin flip outcomes
Data professionals use samples to __
Make inferences about a population
That is, the use the data they collect from a subset of a population to to draw conclusions about a population as a whole.
Representative Sample
A sample that accurately reflects the population
Parameter
A characteristic of a population
Example: The average height of an entire population of giraffes is a parameter
Statistic
A characteristic of a sample
Example: The average height of a random sample of 100 giraffes is a statistic
Parameter vs Statistic
Parameter: a characteristic of a population
Example: The average height of an entire population of giraffes is a parameter
Statistic: A characteristic of a sample
Example: The average height of a random sample of 100 giraffes is a statistic
Measures of Central Tendency
Mean: the average value
*Outliers can skew the mean (e.g., if you have values like 5, 6, 7, 8 and then an outlier like 100 that can throw the mean off/make it substantially different from the median (in this case 7).
Median: the middle value
Note: if there are an even number of values in your dataset, the median is the avg of the two middle values.
Example: 3, 5, 8, 10, 12, 50. The two middle values: 8 and 10. To get the median, take their avg: 8+10/2=18/2=9. 9 is the median.
Mode: The most frequently occurring value in a dataset.
A dataset could have no mode, one mode, or more than one mode.
Examples:
No mode: 1, 2, 3, 4, 5
One mode: 1, 3, 3, 5, 7
Two modes: 1, 2, 2, 4, 4
*Mode is useful for categorical data, because it shows you which category occurs most frequently
Example: Customers rate service bad, mediocre, good, or great.
bad is the most frequently occurring value or mode, indicating that improvements in service are needed.
When to use a mean vs median
If there are outliers: use the median
If there are no outliers: use the mean
Example: You look at 10 homes in a neighborhood. 9/10 are 100,000 and 1/10 is 1M.
The mean: 190k
The median: 100k
In this instance, the mean does not give you a good idea of the average cost of a home in this neighborhood, because only 1/10 of the homes for sale are more than 100k.
The median would be a more representative value for the average cost of a home for sale in this neighborhood.
When should you use mode over median or mean?
When working with categorical data, because it shows you which category occurs most frequently.
Example: Customers rate service bad, mediocre, good, or great.
bad is the most frequently occurring value or mode, indicating that improvements in service are needed.
What to look for in a new dataset
-Measures of central tendency (center): mean, median, mode
-Measures of dispersion (spread): standard deviation, range
Example:
The following sets have similar central tendencies (i.e., the mean is the exactly the same for each) BUT the measure of dispersion/the spread are markedly different.
Set 1: 25, 30, 25
Set 2: 10, 25, 55
Set 3: 5, 10, 75
Range
A measure of dispersion.
The difference between the largest and smallest value in a dataset
The range is a useful metric because it’s easy to calculate, and it gives you a very quick understanding of the overall spread of your dataset.
Example 1: Daily temperatures in a small town over the past week: 77, 74, 72, 71, 67, 69, 72
The highest temp: 77
The lowest temp: 67
Range: 77-67=10
Example 2: For example, imagine you’re a biology teacher and you have data on scores for the final exam. The highest score is 99/100, or 99%. The lowest score is 62/100, or 62%. To calculate the range, subtract the lowest score from the highest score.
99 - 62 = 37
The range is 37 percentage points.
Standard Deviation
A measure of dispersion
Measures how spread out your values are from the mean of your dataset.
It calculates the typical distance of a data point from the mean.