Stats Flashcards

(222 cards)

1
Q

Statistics

A

The study of the collection, analysis, and interpretation of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Econometrics

A

A branch of economics that uses statistics to analyze economic problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A/B Testing

A

A way to compare two versions of something to find out which version performs better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample

A

Subset of a larger population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inferential Statistics

A

Allow data professionals to make inferences about a dataset based on a sample of the data (i.e., use existing data to predict outcomes, e.g., how the next 99k users will behave based on how the first 1k users behaved.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A/B testing can predict with 100% certainty, there will also be a ___

A

confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Confidence Interval

A

A range of values that describes the uncertainty surrounding an estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Statistical Significance

A

The claim that the results of a test or experiment are not explainable by chance alone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A/B Testing Steps

A

-Analyze a small group of users
-Decide on the sample size
-Determine the statistical significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Descriptive Statistics

A

Describe or summarize the main features of a dataset
Useful because they let you understand a large amount of data quickly.

Example: You have the heights of 10M people.
If you summarize the data (i.e., find the mean or median height) you have useful knowledge about the data. Better than starting at 10M rows of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

2 Common Types of Descriptive Statistics

A

-Visuals likes graphs and tables
-Summary Stats: let you summarize your data using a single number (e.g., mean or average value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

2 Main Types of Summary Stats

A

1) Measures of Central Tendency: Describe the center of your dataset (e.g., the mean)

2) Measures of Dispersion: Describe the spread of your dataset or the amount of variation in you data points (e.g., standard deviation: a measure of how dispersed the data is in relation to the mean. )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Statistical Population

A

Every possible element that you are interested in measuring

A statistical population may refer to people, objects, or events.

For example:
Set of all residents in a country
Set of all planets in our solar system
Set of all the outcomes in a 1k coin flips

So samples could be residents, planets, or coin flip outcomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data professionals use samples to __

A

Make inferences about a population

That is, the use the data they collect from a subset of a population to to draw conclusions about a population as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Representative Sample

A

A sample that accurately reflects the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Parameter

A

A characteristic of a population

Example: The average height of an entire population of giraffes is a parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Statistic

A

A characteristic of a sample

Example: The average height of a random sample of 100 giraffes is a statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Parameter vs Statistic

A

Parameter: a characteristic of a population
Example: The average height of an entire population of giraffes is a parameter

Statistic: A characteristic of a sample
Example: The average height of a random sample of 100 giraffes is a statistic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Measures of Central Tendency

A

Mean: the average value

*Outliers can skew the mean (e.g., if you have values like 5, 6, 7, 8 and then an outlier like 100 that can throw the mean off/make it substantially different from the median (in this case 7).

Median: the middle value
Note: if there are an even number of values in your dataset, the median is the avg of the two middle values.
Example: 3, 5, 8, 10, 12, 50. The two middle values: 8 and 10. To get the median, take their avg: 8+10/2=18/2=9. 9 is the median.

Mode: The most frequently occurring value in a dataset.
A dataset could have no mode, one mode, or more than one mode.
Examples:
No mode: 1, 2, 3, 4, 5
One mode: 1, 3, 3, 5, 7
Two modes: 1, 2, 2, 4, 4

*Mode is useful for categorical data, because it shows you which category occurs most frequently
Example: Customers rate service bad, mediocre, good, or great.
bad is the most frequently occurring value or mode, indicating that improvements in service are needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When to use a mean vs median

A

If there are outliers: use the median

If there are no outliers: use the mean

Example: You look at 10 homes in a neighborhood. 9/10 are 100,000 and 1/10 is 1M.
The mean: 190k
The median: 100k

In this instance, the mean does not give you a good idea of the average cost of a home in this neighborhood, because only 1/10 of the homes for sale are more than 100k.
The median would be a more representative value for the average cost of a home for sale in this neighborhood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When should you use mode over median or mean?

A

When working with categorical data, because it shows you which category occurs most frequently.

Example: Customers rate service bad, mediocre, good, or great.
bad is the most frequently occurring value or mode, indicating that improvements in service are needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What to look for in a new dataset

A

-Measures of central tendency (center): mean, median, mode
-Measures of dispersion (spread): standard deviation, range

Example:
The following sets have similar central tendencies (i.e., the mean is the exactly the same for each) BUT the measure of dispersion/the spread are markedly different.

Set 1: 25, 30, 25
Set 2: 10, 25, 55
Set 3: 5, 10, 75

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Range

A

A measure of dispersion.

The difference between the largest and smallest value in a dataset

The range is a useful metric because it’s easy to calculate, and it gives you a very quick understanding of the overall spread of your dataset.

Example 1: Daily temperatures in a small town over the past week: 77, 74, 72, 71, 67, 69, 72

The highest temp: 77
The lowest temp: 67

Range: 77-67=10

Example 2: For example, imagine you’re a biology teacher and you have data on scores for the final exam. The highest score is 99/100, or 99%. The lowest score is 62/100, or 62%. To calculate the range, subtract the lowest score from the highest score.

99 - 62 = 37

The range is 37 percentage points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Standard Deviation

A

A measure of dispersion

Measures how spread out your values are from the mean of your dataset.
It calculates the typical distance of a data point from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Variance
The average of the squared difference of each data point from the mean. (It's the square of the standard deviation.)
26
How to calculate standard deviation
Here's the formula again for population standard deviation: s = √SUM(x - )^2/n-1 Here's how to calculate population standard deviation: Step 1: Calculate the mean of the data—this is in the formula. Step 2: Subtract the mean from each data point x. These differences are called deviations. Data points below the mean will have negative deviations, and data points above the mean will have positive deviations. Step 3: Square each deviation to make it positive. Step 4: Add the squared deviations together. SUM Step 5: Divide the sum by the number of data points in the population (N). The result is called the variance. Step 6: Take the square root of the variance to get the standard deviation (s). Example 1: To better understand the different parts of the formula, let’s calculate the sample standard deviation of a small dataset: 2, 3, 10. You can do this in five steps: 1. Calculate the mean, or average, of your data values. (2 + 3 +10) ÷ 3 = 15 ÷ 3 = 5 2. Subtract the mean from each value. 2 - 5 = -3 3 - 5 = -2 10 - 5 = 5 3. Square each result. -3 * -3 = 9 -2 * -2 = 4 5 * 5 = 25 4. Add up the squared results and divide this sum by one less than the number of data values. This is the variance. (9 + 4 + 25) ÷ (3 -1) = 38 ÷ 2 = 19 5. Finally, find the square root of the variance. √19 = 4.36 The sample standard deviation is 4.36. Example 2: Meteorologists use standard deviation for weather forecasting to understand how much variation exists in daily temperatures in different places and to make more accurate predictions about the weather. City A City B Mean Temp 66 degrees 64 degrees Standard Deviation 3 degrees 16 degrees As the standard deviation is higher in city b, there is more variation in daily temperature in city b than in city a where the weather is more consistent. If the meteorologist just relied on the mean in city b, they could be off as far as 16 degrees for the forecast, which would, understandably, result in many grumpy residents. Knowing the sd gives the meteorologist in city b a useful measure of variance to consider and a level of confidence about their prediction Regardless, a higher standard deviation does make it harder to accurately predict the weather.
27
What are few examples of data professionals using standard deviation to measure variation?
-Ad revenues -Stock prices -Employee salaries -weather forecasts Weather Example: Meteorologists use standard deviation for weather forecasting to understand how much variation exists in daily temperatures in different places and to make more accurate predictions about the weather. City A City B Mean Temp 66 degrees 64 degrees Standard Deviation 3 degrees 16 degrees As the standard deviation is higher in city b, there is more variation in daily temperature in city b than in city a where the weather is more consistent. If the meteorologist just relied on the mean in city b, they could be off as far as 16 degrees for the forecast, which would, understandably, result in many grumpy residents. Knowing the sd gives the meteorologist in city b a useful measure of variance to consider and a level of confidence about their prediction Regardless, a higher standard deviation does make it harder to accurately predict the weather.
28
Example of when knowing sd is helpful
Example: Real estate prices Imagine you're a data professional working for a real estate company. The real estate agents on your team like to inform their clients about the variation in rental prices in different residential areas. Part of your job is calculating the standard deviation of monthly rental prices for apartments in specific neighborhoods, and sharing this information with your team. Let’s say you have sample data on monthly rental prices for one-bedroom apartments in two different neighborhoods: Emerald Woods and Rock Park. Assume you calculate the mean and standard deviation for each dataset. Emerald Woods Apartment #1 #2 #3 #4 #5 Monthly Rent $900 $950 $1,000 $1,050 $1,100 Mean: $1,000 Standard deviation: $79.05 Rock Park Apartment #1 #2 #3 #4 #5 Monthly Rent $500 $650 $1,000 $1,350 $1,500 Mean: $1,000 Standard deviation: $431.56 Both neighborhoods have the same mean rental price of $1,000 per month. However, the standard deviation for rental prices in Rock Park ($431.56) is much higher than the standard deviation for rental prices in Emerald Woods ($79.05). This means that there is a lot more variation in rental prices in Rock Park. This is useful information for your agents. For example, they can tell clients that it may be easier for them to find a more affordable apartment in Rock Park that is far below the mean of $1,000. Standard deviation helps you quickly understand the variation in prices in any given neighborhood.
29
Measures of Position Most Common Measures of Position
Measures of Position: Determine the position of a value in relation to other values in a dataset Most Common Measures of Position: Percentiles, Quartiles, Interquartile range, five number summary
30
Percentile
The value below which a percentage of data falls Percentiles show the relative position or rank of a particular value in a dataset. Example: Many universities require students to take standardized tests (e.g., SAT and ACT in the US). When a student receives their test score, they usually also receive a corresponding percentile. Let's say a test score falls in the 99 percentile, this means that it's higher than 99% of test scores. If it fell in the 77the percentile, it's higher than 77% of test scores. Percentiles are useful for comparing values. A percentile is a measure of position
31
Quartile
Divides the values in a dataset into four equal parts Each quarter contains 25% of the data in your dataset. Q1/25th percentile/lower quartile: 25% of the data is below Q1. 75% of the data is above it. Q2/50th percentile: Q2 is the median. 50% of the data is below Q2. 50% of the data is above it. Q3/75th percentile: 75% of the data is below Q3 and 25% is above it.
32
Example of how to calculate quartiles for a set
Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more)
33
Interquartile range (IQR)
The distance between the first quartile (Q1) and the third quartile (Q3). (The middle 50% of your data.) (The distance between the 25th and 75th percentiles.) Technically, IQR is a measure of dispersion, because it measures the spread of the middle 50% of your data. IQR = Q3 - Q1 Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more) IQR = 25-13=12
34
Five Number Summary
The minimum The first quartile (q1) The median, or second quartile, (Q2) The third quartile (Q3) The maximum Useful because it gives you an idea of the overall distribution of your data from the extreme values to the center. Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more) IQR = 25-13=12 The minimum = 11 The first quartile (q1) = 13 The median, or second quartile, (Q2) = 20 The third quartile (Q3) = 25 The maximum = 38
35
Box Plot
|-------------------------|-------------------------|-------------------------|-------------------------| | | | Min Lower Quartile (Q1) Median (Q2) Upper Quartile (Q3) Max (whisker) (whisker)\ |----------------------------------------------------| (IQR) Example: Q1 = 13 Q2/Median = 20 Q3 = 25 IQR: 25-13 = 12
36
Measures of position can be used by data professionals to better understand a number of things. Provide a few examples.
Measures of position, which determine the position of a value in relation to other values in a dataset, can be used to better understand: -public health data such as life expectancy -macroeconomic data such as household income -business data such as product sales *REMINDER: The Most Common Measures of Position: Percentiles, Quartiles, Interquartile range, five number summary
37
Percentile vs Percentage
Note: Percentiles and percentages are distinct concepts. For example, say you score 90/100, or 90%, on a test. This doesn’t necessarily mean your score of 90% is in the 90th percentile. Percentile depends on the relative performance of all test takers. If half of all test takers score above 90%, then a score of 90% will be in the 50th percentile. A percentile is the value below which a percentage of data falls. Percentiles divide your data into 100 equal parts. Percentiles give the relative position or rank of a particular value in a dataset. For example, percentiles are commonly used to rank test scores on school exams. Let’s say a test score falls in the 99th percentile. This means the score is higher than 99% of all test scores. If a score falls in the 75th percentile, the score is higher than 75% of all test scores. If a score falls in the 50th percentile, the score is higher than half, or 50%, of all test scores. A percentile is a measure of position
38
Use case for percentiles?
Percentiles are useful for comparing values and putting data in context. For example, imagine you want to buy a new car. You’d like a midsize sedan with great fuel economy. In the United States fuel economy is measured in miles per gallon of fuel, or mpg. The sedan you’re considering gets 23 mpg. Is that good or bad? Without a basis for comparison, it’s hard to know. However, if you know that 23 mpg is in the 25th percentile of all midsize sedans, you have a much clearer idea of its relative performance. In this case, 75% of all midsize sedans have a higher mpg than the car you’re thinking about buying.
39
How to determine percentile using python, numpy
Example 1: Find the 40th percentile in the data array import numpy as np data - np.array([10, 20, 30, 40, 50]) np.percentile(data, 40) Example 2: Find the 25th, 50th and 75th percentiles of the data array np.percentile(data, [25, 50, 75]) Note: axis=0 percentile down columns axis=1 percentile across rows SO arr = np.array([ [10, 20, 30], [40, 50, 60] ]) np.percentile(arr, 50, axis=0) the result will give you the median/50th percentile of each column [25., 35., 45.]
40
Measures of Dispersion
Range: The difference between the largest and smallest value in a dataset. Variance: the average of the squared difference of each data point from the mean Standard Deviation: it calculates the typical distance of a data point from the mean of your dataset. (The square root of the variance.) Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1). It indicates the spread of the middle half or the middle 50% of your data.
41
Measure of Position
Percentile: The value below which a percentage of data falls. Divides the values in a dataset into 100 equal parts. Quartile: Divides the values in a dataset into 4 equal parts.
42
Descriptive Stats
Measures of Central Tendency: describe the center of the dataset Measures of Dispersion: describe the spread of your dataset Measures of Position: show the relative location of your data values
43
EDA Steps
1) Discovering: The goal is to understand the context of the data. This is often done by discussing the data with project stakeholders and by reading documentation about the dataset and the data collection process. 2) Structuring 3) Cleaning: Deal with issues like missing data and incorrect values A common step after cleaning: compute descriptive stats to summarize the dataset 4) Joining 5) Validating 6) Presenting
44
Multiplication Rule of Probability
1) Determine if the events are independent or dependent Independent Events: One event's outcome has no affect on the other event's outcome Dependent Events: The second's event's probability is based on the outcome of the second event. Example 1, Independent Events; What's the probability of rolling snake eyes? Each role is an independent event, because neither roll impacts the other. P(A and B) = P(A) * P(B) The probability of a and b occurring together (i.e., rolling 2 snake eyes) = the probability of event a * probability of event b 1/6 * 1/6 = 1/36 So, out of all possible 36 rolls of two dice, 1 of them is snake eyes. Example 2, Independent Events: What's the probability of flipping heads 3 times in a row? 1/2 * 1/2 * 1/2 = 1/8 So, out of all possible 8 flips of the three coins, 1 is all coins flipped heads up. Example 1, Dependent Events: (i.e., the second's event's probability is based on the outcome of the second event.) P(A and B) = P(A) x P(B|A) The probability that event A and B happened = the probability that event A happened * the probability that event B happened given that event A happened. What's the probability that you'll draw an Ace, hold on to it, and then draw a King? Ace: 4/52 (4 out of 52 cards are Aces) King: 4/51 (51 because you already took out an Ace, so 4/51 cards are Kings) 4/52 * 4/51 = 16/2652 > 1/167 chance (a one out of a 167 chance)
45
Addition Rule of Probability
Use the addition rule of probability for situations like the following examples: -Example 1: What's the chance of drawing a Heart OR a Face card? -Example 2: If there are students who play soccer and students who play tennis and some students who play both, what's the probability that a random student selected plays BOTH? Mutually Exclusive Events: The events cannot both occur (e.g., you roll a 2 or a 5 in a single dice roll). P(A or B) = P(A) + P(B) The probability of A or B occurring is the probability of A occurring + the probability of B occurring. Mutually Exclusive Event Example: What's the probability of rolling a 2 or a 5 Probability of rolling a 2: 1/6 Probability of rolling a 5: 1/6 1/6 + 1/6 = 2/6 Example 1 of Events that are NOT mutually exclusive: What's the likelihood of drawing a Heart or a Face Card (e.g., the Queen of Hearts is a Heart AND a face card). P(A or B) = P (A u B) = P(A) + P(B) - P(A and B) Probability of getting event A or B = Probability of A in union of B, which = probability of event A occurring + the probability of event B occurring - the probability of event A and event B occurring together. Hearts: 13/52 (13 hearts in a deck of cards) Face Cards: 12/52 (12 cards) Heart & Face Cards: 3/52 (we must subtract these 3 as otherwise, we're counting the J of Hearts, Q of Hearts, and Q of Hearts twice). 13/52 + 12/52 - 3/52 = 22/52 There's a 22/52 probability that you'll draw a Heart or a Face Card. Example 2 of Events that are NOT mutually exclusive: 50% of the student body plays soccer. 20% of the student body plays tennis. 10% of the student body plays both. What's the probability that a random student plays soccer, tennis, or both? 50% + 20% - 10% = 60% probability that the selected student plays soccer, tennis, or both
46
What python packages do you want to load for stats?
import numpy as np import pandas as pd import matplotlib.pyplot as plt
47
How do you load a csv file?
Example 1: import pandas as pd education_districtwise - pd.read_csv Example 2: import pandas as pd epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0) Note: The index_col parameter can be set to 0 to read in the first column as an index (and to avoid "Unnamed: 0" appearing as a column in the resulting DataFrame).
48
What does describe() give you?
Data professionals use the describe() function as a convenient way to calculate many key stats all at once. For a numeric column, describe() gives you the following output: count: Number of non-NA/null observations mean: The arithmetic average std: The standard deviation min: The smallest (minimum) value 25%: The first quartile (25th percentile) 50%: The median (50th percentile) 75%: The third quartile (75th percentile) max: The largest (maximum) value Note: describe() excludes missing values (NaN) in the dataset from consideration. You may notice that the count, or the number of observations for OVERALL_LI (634), is fewer than the number of rows in the dataset (680). Dealing with missing values is a complex issue outside the scope of this course. You can also use the describe() function for a column with categorical data, like the STATNAME column. For a categorical column, describe() gives you the following output: count: Number of non-NA/null observations unique: Number of unique values top: The most common value (the mode) freq: The frequency of the most common value Example: education_districtwise['STATNAME'].describe() Output: count 680 unique 36 top STATE21 freq 75 Name: STATNAME, dtype: object
49
How do you calculate standard deviation for a column?
Example: epa_data is the dataset aqi is the column np.std(epa_data['aqi'], ddof=1) Note: ddof: delta degrees of freedom When NumPy computes standard deviation, it uses this formula: std=√ ∑(xi​−xˉ)^2/ N−ddof N = number of observations xˉ = mean ddof = how much you subtract from N in the denominator Why do you need to subtract 1? * When you estimate the mean from the same data you're measuring the variability of, you "use up" one degree of freedom. That makes the raw variance too small unless you correct for it. When to use ddof=1: -Your data is a sample - You're doing inference, comparison, or modeling as opposed to pure descriptive statistics. (That is, you're asking "what does this dataset tell us about something bigger" and not "what does this exact dataset look like?" The latter is pure descriptive statistics. -You're following stats conventions (e.g., coursework, reports, analysis) - You want results comparable to pandas, R, or textbooks When to use ddof=0: - You data represents the entire population - You're doing pure descriptive statistics - You care about the exact dispersion of this dataset, not a generalization (i.e., you are looking at this specific dataset and not trying to make generalizations about what next year will look like or other cities etc.) - You're matching NumPy defaults or ML preprocessing pipelines ​NOTE: Pandas defaults to ddof=1 NumPy defaults to ddof=0 so np.std(data) and data.std() are NOT equivalent while, np.std(data, ddof=1) and data.std() are equivalent
50
What is ddof?
ddof: delta degrees of freedom` When NumPy computes standard deviation, it uses this formula: std=√ ∑(xi​−xˉ)^2/ N−ddof N = number of observations xˉ = mean ddof = how much you subtract from N in the denominator Why do you need to subtract 1? * When you estimate the mean from the same data you're measuring the variability of, you "use up" one degree of freedom. That makes the raw variance too small unless you correct for it. When to use ddof=1: -Your data is a sample - You're doing inference, comparison, or modeling as opposed to pure descriptive statistics. (That is, you're asking "what does this dataset tell us about something bigger" and not "what does this exact dataset look like?" The latter is pure descriptive statistics. -You're following stats conventions (e.g., coursework, reports, analysis) - You want results comparable to pandas, R, or textbooks When to use ddof=0: - You data represents the entire population - You're doing pure descriptive statistics - You care about the exact dispersion of this dataset, not a generalization (i.e., you are looking at this specific dataset and not trying to make generalizations about what next year will look like or other cities etc.) - You're matching NumPy defaults or ML preprocessing pipelines ​NOTE: Pandas defaults to ddof=1 NumPy defaults to ddof=0 so np.std(data) and data.std() are NOT equivalent while, np.std(data, ddof=1) and data.std() are equivalent
51
Probabiility
The branch of math that deals with measuring and quantifying uncertainty (i.e., the chance of something happening).
52
Objective probability
Based on stats, experiments, and mathematical measurements 2 Types: Classical Probability: Based on formal reasoning about events with equally likely outcomes. Empirical Probability: Based on experimental or historical data
53
Subjective probability
Based on personal feelings, experience, or judgment
54
Classical Probability
One of the two types of objective probability: classical probability and empirical probability Classical Probability: Based on formal reasoning about events with equally likely outcomes. Calculation: classical probability = number of desired outcomes / total number of possible outcomes Example: Getting a specific card like the Ace of Hearts in a deck has a 1/52 chance Note: Most events are not equally likely (e.g., tomorrow's weather is not 50% chance of rain or snow; it may be an 80% chance of rain), so we need empirical probability.
55
Empirical Probability
One of the two types of objective probability: classical probability and empirical probability Empirical Probability: Based on experimental or historical data It represents the likelihood of an event occurring based on a previous results of an experiment or past events. Calculation: empirical probability = number of times a specific event occurs/total number of events Example: taste test with 100 people. Want to know the probability that a person prefers vanilla over strawberry. 80/100 people prefer it. So the probability that a person prefers vanilla over strawberry is 80%.
56
Events in probability
Probability measures the likelihood of RANDOM events. The result of a random event cannot be predicted with certainty. If the probability of an event = 0, there is a 0% chance that the event will occur If the probability of an event equals 1, there is a 100% chance that the event will occur .5 means there is a 50% chance of the event occurring etc. Rule of Thumb: If the probability of an event is close to 0, there is a small chance that the event will occur If the probability is close to 1, there is a strong chance that the event will occur Example: Wouldn't want to buy a stock that has 0.05 probability of going up, but if it were, 0.95 that would likely be a good investment. 0.05 = 5% 0.95 = 95
57
Random Experiment or Statistical Experiment
A process whose outcome cannot be predicted with certainty. All random experiments/statistical experiments have 3 things in common: 1) The experiment can have more than one possible outcome. 2) You can represent each possible outcome in advance. 3) The outcome of the experiment depends on chance.
58
Classical Probability
Number of desired outcomes/total number of possible outcomes
59
Outcome
In stats, the results of a random experiment/statistical experiment is called an outcome. Example: If you roll a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6
60
Event
In stats, an event is a set of one or more outcomes. Example: If you roll a die, an event might be rolling an event number. The event of rolling an even number consists of the outcomes 2, 4, 6. The event of rolling an odd number consists of the outcomes 1, 3, 5
61
Probability of an event
The probability that an event will occur is expressed as a number between 0 and 1. Probability can also be expressed as a percent. If the probability of an event equals 0, there is a 0% chance that the event will occur. If the probability of an event equals 1, there is a 100% chance that the event will occur. There are different degrees of probability between 0 and 1. If the probability of an event is close to zero, say 0.05 or 5%, there is a small chance that the event will occur. If the probability of an event is close to 1, say 0.95 or 95%, there is a strong chance that the event will occur. If the probability of an event equals 0.5, there is a 50% chance that the event will occur—or not occur.
62
When you say that the probability of getting heads in a coin toss is 50%, you aren't saying what exactly?
Note that when you say the probability of getting heads is 50%, you aren’t claiming that any actual sequence of coin tosses will result in exactly 50% heads. For example, if you toss a fair coin ten times, you may get 4 heads and 6 tails, or 7 heads and 3 tails. However, if you continue to toss the coin, you can expect the long-run frequency of heads to get closer and closer to 50%.
63
Probability Notation
P: indicates the probability of an event A: represents an individual event B: represents an individual event ' : means an event doesn't occur so P(A'): the probability of event A not happening P(A): the probability of event A happening Examples: -The probability of event A is written as P(A). -The probability of event B is written as P(B). -For any event A, 0 ≤ P(A) ≤ 1. In other words, the probability of any event A is always between 0 and 1. -If P(A) > P(B), then event A has a higher chance of occurring than event B. -If P(A) = P(B), then event A and event B are equally likely to occur.
64
Complement of an event
In stats, the complement of an event is an event not occurring.
65
Completement rule (for mutually exclusive events)
In stats, the complement of an event is an event not occurring. Complement rule says that the probability that event A does not occur is P(A') = 1 - P(A) This rule applies to events that are mutually exclusive Note: Mutually Exclusive Events: Two events are mutually exclusive if they cannot occur at the same time.
66
Mutually Exclusive Events:
Two events are mutually exclusive if they cannot occur at the same time. Example: You can't be in China and Argentina at the same time. You can't roll a 2 and a 6 in the same single roll of the die.
67
Addition rule (for mutually exclusive events)
P(A or B) = P(A) + P(B) Example: What's the probability of rolling either a 2 or a 4 in a single roll? P(A or B) = P(A) + P(B) P(rolling a 2 or rolling a 4) = P(rolling a 2) + P(rolling a 4) P(1/6) +P(1/6) = 2/6 = 1/3 so it's 33%
68
Independent Events
Two events are independent if the occurrence of one event does not change the probability of the other event. Example: Checking out a book from the library does not affect tomorrow's weather.
69
Multiplication Rule (for independent events)
P(A and B) = P(A) * P(B) P(first toss tails and second toss heads) = P(first toss tails) * P(2nd toss heads) = .5 *.5 = .25
70
Conditional Probability
The probability of an event occurring given that another event has already occurred
71
Dependent Events
Two events are dependent if the occurrence of one event changes the probability of the other event Examples: If you want to travel to another country, you need to have a passport If you want to access a website, you need internet access Conditional Probability Calculation: P(A and B) = P(A) * P(B|A) OR P(B|A) = P(A and B) / P(A) P(A and B) = probability of event A and event B P(A) = probability of event A P(B|A) = probability of event B given event A Note: B|A: vertical bar means that event "B" depends on event "A" happening Example: What's the probability of drawing an ace from a deck of cards and then another ace from that same deck? P(A): chance of getting an ace on the first draw: 4/52 P(B|A): chance of getting an ace on the second draw: 3/51 P(A and B): ace on the first draw and second draw: P(A) * P(B|A) =4/52 * 3/52 =1/ 221 = 0.5% Example 2: What's the probability that you'll get accepted by college z and receive a scholarship from college z? acceptance rate: 10/100 applicants: 10% scholarships awarded: 2/100 accepted students: 2% 10/100 * 2/100 = 1/500=0.2% Business Use Case: Use conditional probability to predict how an event like an ad campaign will impact sales revenue and then share findings with stakeholders so they can make more informed business decisions.
72
Conditional Probability
This is for dependent events. Note: Two events are dependent if the occurrence of one event changes the probability of the other event Conditional Probability Calculation: P(A and B) = P(A) * P(B|A) OR P(B|A) = P(A and B) / P(A) P(A and B) = probability of event A and event B P(A) = probability of event A P(B|A) = probability of event B given event A Note: B|A: vertical bar means that event "B" depends on event "A" happening Example: What's the probability of drawing an ace from a deck of cards and then another ace from that same deck? P(A): chance of getting an ace on the first draw: 4/52 P(B|A): chance of getting an ace on the second draw: 3/51 P(A and B): ace on the first draw and second draw: P(A) * P(B|A) =4/52 * 3/52 =1/ 221 = 0.5% Example 2: What's the probability that you'll get accepted by college z and receive a scholarship from college z? acceptance rate: 10/100 applicants: 10% scholarships awarded: 2/100 accepted students: 2% 10/100 * 2/100 = 1/500=0.2% Business Use Case: Use conditional probability to predict how an event like an ad campaign will impact sales revenue and then share findings with stakeholders so they can make more informed business decisions. Example 3: Example: online purchases Let’s explore another example. Imagine you are a data professional working for an online retail store. You have data that tells you 20% of the customers who visit the store’s website make a purchase of $100 or more. If a customer spends $100, they are eligible to receive a free gift card. The store randomly awards gift cards to 10% of the customers who spend at least $100. You want to calculate the probability that a customer spends $100 and receives a gift card. Receiving a gift card depends on first spending $100. So, this is a conditional probability because it deals with two dependent events. Let's apply the conditional probability formula: P(A and B) = P(A) * P(B|A) You want to calculate the probability of both event A and event B occurring. Let’s call event A $100 and event B gift card. The probability of event A is 0.2, or 20%. The probability of event B is 0.1, or 10%. P($100 and gift card) = P($100) * P(gift card given $100) = 0.2 * 0.1 = 0.02, or 2% So, the probability of a customer spending $100 or more and receiving a free gift card is 0.2 * 0.1 = 0.02, or 2%.
73
Bayes Theorem or Bayes Rule
P(A|B) = P(B|A) * P(A) / P(B) prior probability in Bayesian statistics: the probability of an event before new data is collected posterior probability in Bayesian statistics: the updated probability of an event based on new data Example: Let's say a medical condition is related to age, you can use Bayesian probability to determine if a person has the condition based on age. prior probability: the probability of a person having the condition posterior probability: the probability of a person having the condition if they're in a certain age group The calculation: P(A|B) = P(B|A) * P(A) / P(B) In English: for any 2 events A and B, the probability of A, given B = the probability of A multiplied by the probability of B given A / probability of B Math Terms: P(A): prior probability/probability a person has the condition P(A|B) posterior probability/probability a person has the condition if they're in a certain age group Note: Sometimes statisticians use the term "likelihood" to refer to the probability of event B given event A, and the term "evidence" to refer to the probability of event B. P(B|A): likelihood P(B): evidence SO P(A|B) = P(B|A) * P(A) / P(B) posterior = likelihood * prior / evidence posterior: prob a person has the condition if they're in a certain age group likelihood: among the people with the condition, the fraction that are in the age group (e.g., 65+) prior: the prob that a person has the condition evidence: fraction of all people are 65+ Example 2: spam filter A well-known application of Bayes’s theorem in the digital world is spam filtering, or predicting whether an email is spam or not. In practice, a sophisticated spam filter deals with many different variables, including the content of the email, its title, whether it has an attachment, the domain type of the sender address (.edu or .org), and more. However, we can use a simplified version of a Bayesian spam filter for our example. Let’s say you want to determine the probability that an email is spam given a specific word appears in the email. For this example, let’s use the word “money.” You discover the following information: -The probability of an email being spam is 20%. -The probability that the word “money” appears in an email is 15%. -The probability that the word “money” appears in a spam email is 40%. In this example, your prior probability is the probability of an email being spam. Your posterior probability, or what you ultimately want to find out, is the probability that an email is spam given that it contains the word “money.” The new data you will use to update your prior probability is the probability that the word “money” appears in an email and the probability that the word “money” appears in a spam email. When you work with Bayes’s theorem, it’s helpful to first figure out what event A is and what event B is—this makes it easier to understand the relationship between events and use the formula. Let’s call event A a spam email and event B the appearance of the word “money” in an email. Now, you can re-write Bayes’s theorem using the word “spam” for event A and the word “money” for event B. P(A|B) = P(B|A) * P(A) / P(B) P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) You want to find out the following: P(Spam | Money), or posterior probability: the probability that an email is spam given that the word “money” appears in the email Now, enter your data into the formula: -P(Spam), or prior probability: the probability of an email being spam = 0.2, or 20% -P(Money), or evidence: the probability that the word “money” appears in an email = 0.15, or 15% -P(Money | Spam), or likelihood: the probability that the word “money” appears in an email given that the email is spam = 0.4, or 40% -P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) = 0.4 * 0.2 / 0.15 = 0.53333, or about 53.3%. So, the probability that an email is spam given that the email contains the word “money” is 53.3%.
74
Prior Probability in Bayesian Statistics
The probability of an event before new data is collected
75
Posterior Probability in Bayesian Statistics
The updated probability of an event based on new data
76
Bayesian Statistics/Bayesian Inference
A powerful method for analyzing and interpreting data in modern data analytics Bayes' theorem is used in a variety of fields including, but not limited to: AI, medical testing, and financial institutions, online retailers, marketers, etc. Examples: -Financial institutions use Bayesian stats to rate the risk of lending money to borrowers or to predict the success of an investment. -online retailers use Bayesian algorithms to predict whether or not users will like certain products and services -marketers rely on Bayes' theorem for identifying positive or negative responses for customer feedback.
77
Bayes Theorem, basic vs expanded version
Basic: P(A|B) = P(B|A) ( P(A)/P(B) Expanded: P(A|B) = P(B|A) * P(A)/ P(B|A) * P(A) + P(B|not A) * P(not A) Use the expanded version when you don't need to know the probability of event B to use the expanded version. The expanded version is often used to evaluate: -medical diagnostic tests -quality control tests -software tests Example: Evaluate the accuracy of a diagnostic test -1% of the population has the medical condition -If a person has the condition, there's a 95% chance that the test is positive -If a person does not have the condition, there's still a 2% chance that the test is positive. prior probability = the probability that a person has the medical condition (1%) posterior probability = the probability that the condition is present GIVEN that the test is positive. (95%) Event A = actually having the medical condition Event B = testing positive *Note: these events are different as you can test positive and not have the allergy. P(A) : the probability that a person has the condition = 1% P(B|A}: probability of testing positive given that the person has the condition = 95% P(B|not A): probability of testing positive given that the person does NOT have the condition = 2% Use the complement rule to determine the probability of not having the condition. P(A')=1-P(A) the probability that event A does not occur is 1 - probability that the event does occur P(A')=1-.01=.99=99% So, 99% probability that a person does not have the condition. Use the expanded version of Bayes' Theorem, because you don't know the probability of event B (the probability that a person gets a positive test result). P(A|B) = .95 * .01/.95*.01+.02*.99 = 0.324 = 32.4% So, P(A|B) the probability that the condition is present given that the test is positive is 32.4% Low because the allergy is rare to begin with. It's not likely that a person would test positive AND have the allergy. Most people don't test positive.
78
False Positive
Test result that indicates something is present when it's really not
79
False Negative
Test result that indicates something is not present when it really is
80
Probability Distribution
Describes the likelihood of the possible outcomes of a random event
81
Random variable
Represents the values for the possible outcomes of a random event 2 Types of Random variables: -Discrete random variables: has a countable number of possible values (e.g., whole numbers that can be counted, so number of people in a room, etc.) -Continuous random variables: takes all the possible values in some range of numbers (e.g., decimal values like height, weight, etc.) General Rule of Thumb: -If you can count the number of outcomes, it's discrete (e.g., the number of times a coin flip results in heads) -If you measure the outcomes, it's continuous (e.g., a person's time on a marathon) Note: -Discrete distributions represent discrete random variables -Continuous distributions represent continuous random variables
82
Sample Space
The set of all possible values for a random variable
83
Discrete probability distributions
represent discrete random variables, or discrete events
84
Binomial Distribution
A discrete distribution that models the probability of events with only two possible outcomes: success or failure This definition assumes that -each event is independent (i.e., an event does not affect the probability of the other event/s) -the probability of success is the same for each event Note: You can label any outcome as a "success". For example, each coin toss has only two possible options: heads or tails. Either heads or tails could be labeled as a success based on the needs of your analysis. *But whatever label you apply to the outcomes, they MUST be mutually exclusive. Business use cases: Binomial distributions are used to model the probability that -a new medication generates side effects -a credit card transaction is fraudulent -a stock price rises or falls in value In machine learning, a binomial distribution is often used to classify data. Examples: train an algorithm to recognize whether an image is or is not a cat. Binomial Distribution represents a random event called a binomial experiment.
85
Mutually Exclusive
Two outcomes are mutually exclusive if they cannot occur at the same time
86
Binomial Experiement
A type of random experiment -The experiment consists of a number of repeated trials -Each trial has only two possible outcomes -The probability of success is the same for each trial -Each trial is independent Example 1 of a Binomial Experiment: Tossing a Coin 10 times in a row. -10 repeated coin tosses -two possible outcomes: heads or tails -the probability of success for each toss is the same: 50% -the outcome of any one coin toss does not affect the outcome of any other coin toss Example 2 of a Binomial experiment: You want to know how many customers return an item to a department store on a given day. 100 customers visit the store each day. 10% of all customers who visit the store make a return. You label a return as a success. This is a binomial experiment because -100 customer visits -2 possible outcomes: return or no return -the probability of success for each customer visit is the same: 10% -the outcome of one customer visit does not affect the outcome of any other customer visit
87
Random Experiment
a process whose outcome cannot be predicted with certainty. All random experiments have 3 things in common: -The experiment can have more than one possible outcome -you can represent each possible outcome in advance -the outcome of the experiment depends on chance
88
Random Experiment vs Binomial Experiment
Random Experiment: a process whose outcome cannot be predicted with certainty. All random experiments have 3 things in common: -The experiment can have more than one possible outcome -you can represent each possible outcome in advance -the outcome of the experiment depends on chance Binomial Experiment: A type of random experiment -The experiment consists of a number of repeated trials -Each trial has only two possible outcomes -The probability of success is the same for each trial -each trial is independent Example 1 of a Binomial Experiment: Tossing a Coin 10 times in a row. -10 repeated coin tosses -two possible outcomes: heads or tails -the probability of success for each toss is the same: 50% -the outcome of any one coin toss does not affect the outcome of any other coin toss Example 2 of a Binomial experiment: You want to know how many customers return an item to a department store on a given day. 100 customers visit the store each day. 10% of all customers who visit the store make a return. You label a return as a success. This is a binomial experiment because -100 customer visits -2 possible outcomes: return or no return -the probability of success for each customer visit is the same: 10% -the outcome of one customer visit does not affect the outcome of any other customer visit
89
different types of distributions help you __
model different types of data
90
Poisson Distribution
Models the probability that a certain number of events will occur during a specific time period Use Cases: You can use the Poisson distribution to model data such as: -calls per hour for a customer service call center -visitors per hour for a website -customers per day at a restaurant -severe storms per month in a city
91
Poisson Experiment
A type of random experiment. Always have the following attributes: -The number of events in the experiment can be counted -The mean number of events that occur during a specific time period is known -Each event is independent Example: The drive-through at a restaurant receives an average of 2 orders per minute. You want to determine the probability that the restaurant will receive a certain number of orders in a given minute. You can tell this is a Poisson experiment because it meets the above stated criteria. That is, -you can count the number of orders -there is an average of 2 orders per minute -the probability of one person placing an order does not affect the probability of another person placing an order
92
How do you know when to something is a poisson experiment (thus needing the poisson distribution to model data) or a binomial experiment (thus needing the binomial distribution to model data)?
Type Given Want to Find Example Poisson The avg probability of The probability of The prob of getting 12 calls an event happening for a certain # of between 2-3 pm a specific time period events happening in that time period Binomial An exact probability of The prob of the event The prob of getting 8 an event happening happening a certain # heads in 10 coin tosses of times in a repeated trial (exact prob = 50%)
93
What are discrete distributions
The binomial and poisson
94
probability distribution
probability distribution describes the likelihood of the possible outcomes of a random event.
94
uniform distribution
a uniform distribution describes events whose outcomes are all equally likely or have equal probability Example: Rolling a die can result in 6 outcomes: 1, 2, 3, 4, 5, or 6. The probability of each outcome is the same: 1/6 or about 16.7%
94
Bernoulli Distribution
Like the binomial distribution, it models events that have only two possible outcomes: success or failure. The only difference: the Bernoulli distribution refers to only a single trial of an experiment., while the binomial refers to repeated trials. Classic example of a Bernoulli: one coin toss.
95
normal distribution
Often called a bell curve, because of its shape. Often known as a Gaussian distribution. A continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped. The most common probability distribution in statistics, because so many data sets display a bell curve. Example 1: If you randomly sample 100 people, you'll discover a normal distribution of continuous variables like -height -weight -blood pressure -IQ scores -salaries Example 2: Standardized tests. The majority of people will score close to the average/mean score. Fewer numbers of people will score above or below average. Normal distributions have the following features: -the shape is a bell curve -the mean is located at the center of the curve -the curve is symmetrical on both sides of the center -the total area under the curve equals 1
96
standard deviation
calculates the typical distance of a data point from the mean of a dataset
97
the empirical rule
-68% of values fall within 1 standard deviation of the mean -95% of values fall within 2 standard deviations of the mean -99.7% of values fall within 3 standard deviations of the mean the empirical rule can give you an idea of how values in your dataset are distributed. (e.g., what percentage of values with fall within 1 or 2 or 3 standard deviations of the mean). It's also helpful for detecting outliers. (Standardly values that lie more than 3 standard deviations above or below the mean are outliers.) You need to detect outliers because some values may be due to errors in data collection or data processing. These false values may skew the results of your analysis. Note: standard deviation: calculates the typical distance of a data point from the mean of a dataset
98
continuous probability distributions
binominal distribution bernoulli distribution poisson distribution Note: continuous probability distributions represent continuous random variables, which can take on all the possible values in a range of numbers. Typically, these are decimal values that can be measured, such as height, weight, time, or temperature. For example, you can keep on measuring time with more accuracy: 1.1 seconds, 1.12 seconds, 1.1257 seconds, and so on. Because there are infinite values that X could assume, the probability of X taking on any one specific value is zero. Therefore we often speak in ranges of values (p(X>0) = .50). The normal distribution is one example of a continuous distribution. (p(X>0) = .50 means the probability that x is greater than 0 equals 50%. *Why 50%? Because the distribution is centered at zero (the bell curve), so half the distribution is greater than 0 and the other half is less than 0. And because the normal distribution is a continuous distribution, we can not calculate exact probability for an outcome, but instead we calculate a probability for a range of outcomes (for example the probability that a random variable X is greater than 10). Example: Let's say we want to calculate the probability that z is between -1 and 1. To do so, first look up the probability that z is less than negative one [p(z)<-1 = 0.1538]. (This means that the probability that z is less than negative one (i.e, -2, etc.) is 0.1538) Because the normal distribution is symmetric, we therefore know that the probability that z is greater than one also equals 0.1587 [p(z)>1 = 0.1587]. (So, the probability that z is greater than 1 is also 0.1587.) To calculate the probability that z falls between 1 and -1, we take 1 – 2(0.1587) = 0.6826. Because .1587*2 represents the probability that z falls below 1 or above -1. 1-(.1587*2) is the prob that z is between 1 and -1 Visual: left tail: .1587 middle (unknown & what we want): right tail: .1587 total probability must equal 1. We know that the left and right tail together = .1587 * 2 ergo 1 -(.1587*2) = the middle.
99
probability density vs probability
Probability: actual chance of an event. Probability answers questions like: What is the probability a die lands on 4? → 1/6 What is the probability a person is between 5'6" and 5'8"? → maybe 0.18 Probability is always between: 0≤𝑃≤1 It represents the chance that something happens. Probability density: concentration of probability Probability density tells you: How tightly probability is packed around a specific value It is NOT itself a probability. Instead, it helps you calculate probability over a range. Analogy: population density vs population This is the BEST real-world analogy. Population density: 100 people per square mile Population: depends on how much area you look at Example: Density = 100 people/sq mile Area = 5 sq miles Population = 100 × 5 =500 100×5=500 Same idea: Stats Geography probability density population density probability total population range width area size
100
z-score
A measure of how many standard deviations below or above the population mean a data point is Also called standard scores as they're based on the standard normal distribution, which has a mean of zero and standard deviation of 1. z-scores typically range from -3 tp 3. Example: -The z-score is 0 if the value is equal to the mean -The z-score is positive if the value is greater than the mean -The z-score is negative is the value is less than the mean - A z-score of 1.5 is 1.5 standard deviations above the mean -A z-score of 2 is 2 standard deviations below the mean z = x - mu (mew)/sigma z: z-score x: raw score or single data value mu: the population mean sigma: the population's standard deviation Example: You take a standardized test. You score 133. The test has a mean score of 100 and a standard deviation of 15. z = 133-100/15 z = 2.2 So, your test score is 2.2 standard deviations above the mean. z-scores are useful, because they give us an idea of how an individual value compares to the rest of the distribution.
101
standardization
The process of putting different variables on the same scale
102
Sampling
The process of selecting a subset of data from a population
103
In stats, population can refer to:
any type of data, including: people, organizations, objects, events, measurements, and more For instance, a population might be the set of: -All students at a university -All the cell phones ever manufactured by a company -All the forests on Earth
104
Representative Sample
Accurately reflects the characteristics of a population
105
example questions answered by sampling:
-How many products in an app store do we need to test to feel confident that all the products are secure from malware -How do we select a sample of users to run an effective A/B test for an online retail store? -How do we select a sample of customers of a video streaming service to get reliable feedback on the shows they watch?
106
Sampling is useful, because
-a sample requires less time than a full population -a sample saves money and resources -a sample is more practical than analyzing an entire population
107
If your predictive model is based on a bad sample, then your predictions__
will not be accurate. Ultimately, the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about a population, make sure your sample is representative of the population.
108
The sampling process
Step 1: Identify the target population (e.g., individuals 18 years and older who are eligible to vote in x city) Step 2: Select the sampling frame (a list of individuals who meet the criteria of the target population, starting with A. Adams and ending with B zenya.) Step 3: Choose the sampling method Step 4: Determine the sample size Step 5: Collect the sample data Example 2: you’re a data professional working for a company that manufactures home appliances. The company wants to find out how customers feel about the innovative digital features on their newest refrigerator model. The refrigerator has been on the market for two years and 10,000 people have purchased it. Your manager asks you to conduct a customer satisfaction survey and share the results with stakeholders. Step 1: Identify the target population: the 10,000 customers who purchased the company’s newest refrigerator model. Step 2: Create a sampling frame: an alphabetical list of the names of all these customers ** Note: Ideally, your sampling frame should include the entire target population. However, for practical reasons, your sampling frame may not exactly match your target population, because you may not have access to every member of the population For instance, the company’s customer database may be incomplete, or contain data processing errors. Or, some customers may have changed their contact information since their purchase, and you may be unable to locate or contact them. Furthermore, sometimes the sampling frame might include elements outside of the target population simply by accident or because it is impossible to know the target population with certainty. ** Step 3: Choose the sampling method (i.e., probability sampling or non-probability sampling). Step 4: Determine the sample size., since you don’t have the resources to survey everyone in your sampling frame. ** Note: In general, the larger the sample size, the more precise your predictions. However, using larger samples typically requires more resources. The sample size you choose depends on various factors, including the sampling method, the size and complexity of the target population, the limits of your resources, your timeline, and the goal of your research. Based on these factors, you can decide how many customers to include in your sample. ** Step 5: Collect the data: You give a customer satisfaction survey to the customers selected for your sample. The survey responses provide useful data on how customers feel about the digital features of the refrigerator. Then, you share your results with stakeholders to help them make more informed decisions about whether to continue to invest in these features for future versions of this refrigerator, and develop similar features for other models.
109
Target Population
The complete set of elements that you're interested in knowing more about
110
Sampling Frame
A list of all the items in your target population
111
Sampling Methods (basic, not super specific, definition)
Probability Sampling: Use random selection to generate a sample. **each person has an equal chance of being selected--this gives you the best chance of a representative sample Non-Probability Sampling: Based on convenience or personal preference.
112
Sample Size
The number of individuals or items chosen for a study or experiment
113
Probability Sampling Methods
-Simple random sampling -Stratified random sampling -Cluster random sampling -Systematic random sampling (These are all based on random selection, which is the preferred method for accurately representing a population and reducing bias.)
114
Simple Random Sample
A probability sampling method. Every member of a population is selected randomly and has an equal chance of being chosen. You randomly select members using a random number generator or by another method of random selection. Example: There are 1,000 people at your company. You assign a number to each person in your database. Then you use a random number generator to select 100 people. Pros: They tend to be fairly representative and tend to avoid bias since every member of the population has an equal chance of being selected. Cons: It's expensive and time-consuming to conduct large simple random sampling
115
Stratified Random Sample
A probability sampling method. Divide a population into groups and randomly select some members from each group to be in the sample. These groups are called strata. Strata can be organized by age, gender, income, or whatever category you wish to study. Example: You want to conduct a study on how much time high school students spend studying on weekends. You could divide the student pop by age (14, 15, 16, and 17). Then survey and equal number of students from each age group. Pros: Helps ensure that members from each group (i.e., strata) are included. Cons: It can be hard to develop strata if you lack knowledge about the pop to be studied. Example: If you don't know how relevant job title, industry, etc. are to median income (and you're doing a study on just that) it will be difficult to choose the best category (read: strata).
116
Cluster Random Sample
A probability sampling method. Divide a population into clusters, randomly select certain clusters and include all members from the chosen clusters in the sample. Clusters are created based on identifying details (e.g., age, gender, location, etc.). Example: You want to conduct a survey of employees at a global company using this method. The company has ten offices in ten cities around the world. Each office has about the same number of employees with similar roles. You randomly select 3 offices in 3 cities as clusters. You include all the employees at the 3 offices. Pros: -Gets every member from a particular cluster, which is useful when each cluster reflects the population as a whole. -Helpful when dealing with large and diverse populations that have clearly defined subgroups. Example: If researchers want to learn about the preferences of primary school students in Oslo, Norway, they can use one school as representative of all schools in the city. Con: It may be difficult to create clusters that accurately reflect the overall population. Example: you may only have access to offices in the united states vs the whole world, and employees in the US may have different characteristics and values.
117
Systematic Random Sample
A probability sampling method. Put every member of a population into an ordered sequence. Then, you choose a random starting point in the sequence and select members for your sample at regular intervals. Pro: -Often representative of a pop since each member has an equal chance of being represented -Quick and convenient when you have a complete list of the population Con: -You need to know the size of the pop that you want to study before you begin. (If you don't have this info, it's difficult to choose consistent intervals).
118
Sampling Bias
When a sample is not representative of the population as a whole
119
Types of sampling methods
Probability sampling methods use random selection, which helps avoid sampling bias. Non-probability sampling methods do not use random selection. These often result in biased samples. (The sample is often not representative of the population as a whole.)
120
Why is non-probability sampling used?
It's often less expensive and more convenient for researchers to conduct. Issues: It often results in biased samples. (The sample is often not representative of the population as a whole.)
121
Non-probability methods
-convenience sampling -voluntary response sampling -snowball sampling -purposive sampling
122
Convenience Sampling
Non-probability sampling. Choose members of a population that are easy to contact or reach (e.g., you workplace, school, or public park). Example: To conduct an opinion poll, a researcher may stand in front of a local high school during the day to poll the people who happen to walk by. Cons: Convenience sampling often shows undercoverage bias. (In the above example, people who don't work or attend the school will not be represented in a sample.) ** Undercoverage bias: When some members of a population are inadequately represented in a sample.
123
Undercoverage Bias
When some members of a population are inadequately represented in a sample.
124
Voluntary Response Sampling
Non-probability sampling. Consists of members of a population who volunteer to participate in a study. Example: Restaurant owners want to know how their customers feel about their dinner options. They ask their regular customers to take an online survey about the quality of the restaurant's food. Cons: Voluntary response sampling tends to suffer from nonresponse bias. Note: People who voluntarily respond are likely to have stronger opinions (either positive or negative) than the rest of the population. This makes the volunteer respondents at the restaurant in the above example an unrepresentative sample. ** Nonresponse Bias: When certain groups of people are less likely to provide responses.
125
Snowball Sampling
Non-probability sampling. Researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. (Like a snowball, the sample gets bigger and bigger as more participants join in.) This type of recruiting can result in sampling bias. Since initial participants will recruit additional participants on their own, it's likely that they'll share similar characteristics. And these characteristic may not be representative of the total pop being studied.
126
Purposive Sampling
Non-probability sampling. Researchers select participants based on the purpose of their study. Applicants who do not fit the profile are rejected. This can lead to biased outcomes, because the individuals in the sample are not representative of the population as a whole. Example: Researcher wants to survey students on the efficacy of certain teaching methods at their university. The researcher only includes the students who regularly attend class and have an established record of academic achievement. They select the students with the highest grade point averages. Issue: biased outcome. (see above note.)
127
Nonresponse Bias
When certain groups of people are less likely to provide responses.
128
When is non-probability sampling useful?
Non-probability sampling is useful for collecting data in situations where you have limited time, budget, and other resources. Non-probability sampling is also useful for exploratory research, when you want to get an initial understanding of a population, rather than make inferences about the population as a whole. However, it’s important to remember that non-probability sampling methods have a high risk of sampling bias.
129
Statistic vs parameter
Statistic: A characteristic of a sample Parameter: A characteristic of a population Example: -The mean weight of a random sample of 100 penguins is a statistic -The mean weight of the total population of 10,000 penguins is a parameter
130
Point Estimate
Uses a single value to estimate a population parameter Example: A data professional may use the mean weight of 100 penguins to estimate the mean weight of the population (of penguins). This is an example of using a single value to estimate a population parameter. Note: Parameter: A characteristic of a population -The mean weight of the total population of 10,000 penguins is a parameter (-The mean weight of a random sample of 100 penguins is a statistic)
131
Sampling distribution
A probability distribution of a sample statistic Let's say you take repeated samples of the same size from a population. Since each sample is random, the mean value will vary from sample to sample in a way that can't be predicted. Notes: Probability Distribution: possible outcomes of a random variable like a coin toss or a die roll. Statistic: A characteristic of a sample (e.g., mean weight of a random sample of 100 penguins). Example of sampling distribution: Pop of 10,000 penguins. You take repeated samples of the same size from this population, looking at weight. Since each sample is random, the mean value will vary from sample to sample in a way that can't be predicted. First sample of 10 penguins, mean weight: 3.1 lbs Second sample of 10 penguins, mean weight: 2.8 lbs Third sample of 10 penguins, mean weight: 2.9 lbs Each time you take a sample, you'll get closer to the population mean. (You can get outliers, e.g., a sample of larger than average penguins or smaller than average penguins)
132
Sampling variability
how much an estimate varies between samples (You can use a sampling distribution to represent the frequency of all your different sample means.)
133
As you increase the size of a sample, the mean weight of your sample data will get closer to___. If you sampled the entire population, your sample mean would be the same as ____. But to get an accurate estimate of the population mean, you don't need to sample the entire population. If you take a large enough sample from a population (e.g., ____, your sample mean will be___.
As you increase the size of a sample, the mean weight of your sample data will get closer to the mean weight of the population. If you sampled the entire population, your sample mean would be the same as your population mean. But to get an accurate estimate of the population mean, you don't need to sample the entire population. If you take a large enough sample from a population (e.g., 100 from 10,000), your sample mean will be an accurate estimate of the population mean.
134
If your sample is large enough, you sample mean will roughly equate to the ___
Population mean Example: Sample of 100 penguins, mean weight: 3 lbs. So, your best estimate for the entire population is 3 lbs. The population mean, in this example, is 3.1 lbs.
135
The more variability in your sample data, the less likely the sample mean is_____. Data professionals use the standard deviation of the sample means to measure this variability.
The more variability in your sample data, the less likely the sample mean is an accurate estimate of the population mean. Example: Population mean for blue penguins: 3.1 lbs -sample mean 1: 3.3 lbs -sample mean 2: 2.8 lbs -sample mean 3: 2.4 lbs Note: Standard deviation measures the variability of your data (i.e., how spread out your values are). The more spread between the data values, the larger the standard deviation.
136
When reviewing sample means, a larger standard errors means________ a smaller standard error means___________
Larger standard error = sample means are more spread out Smaller standard error = sample means are closer together
137
Keep in mind that the concept of standard error assumes ____ In reality, researchers usually work with a single sample. It's often too complicated, expensive, or time-consuming to take repeated samples of a population. Instead, statisticians have _____
Keep in mind that the concept of standard error assumes repeated sampling. In reality, researchers usually work with a single sample. It's often too complicated, expensive, or time-consuming to take repeated samples of a population. Instead, statisticians have derived a formula for calculating standard error based on the mathematical assumption of repeated sampling. Standard error of the mean: s/√n s: sample standard deviation n: sample size Example: Sample of 100 blue penguins has a mean weight of 3 lbs and a standard deviation of 1 lb. =s/√n =1/√100 =0.1 lbs So, 0.1 is the standard error of the mean
138
As the sample size gets larger, the standard errors gets ___
smaller
139
The central limit theorem can be used to estimate (give a few examples):
-the mean annual household income for an entire city or country -the mean height and weight for an entire animal or plant population -the mean commute time for all employees of a large corporation
140
Central Limit Theorem
As the sample increases, your sampling distribution assumes the shape of a bell curve. AND If you take a large enough sample, the sample mean will be roughly equal to the population mean. This holds true for ANY population. You don't need to know the shape of your population (i.e., right-skewed or left-skewed distribution, etc.) in advance to apply the theorem. If you collect a large enough sample, the shape of your sampling distribution will follow a normal distribution. Note: There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply, but in general, a sample size of thirty or more is considered sufficient. Exploratory Data Analysis (EDA) can help you determine how large of a sample size is necessary for a given data set.
141
In order to apply the central limit theorem, the following conditions must be met:
-Randomization: Your sample data must be the result of random selection. Random selection means that every member in the population has an equal chance of being chosen for the sample. -Independence: Your sample values must be independent of each other. Independence means that the value of one observation does not affect the value of another observation. Typically, if you know that the individuals or items in your dataset were selected randomly, you can also assume independence. -10%: To help ensure that the condition of independence is met, your sample size should be no larger than 10% of the total population when the sample is drawn without replacement (which is usually the case). (Note: In general, you can sample with or without replacement. When a population element can be selected only one time, you are sampling without replacement. When a population element can be selected more than one time, you are sampling with replacement.) There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply. The answer depends on the following factors: -Requirements for precision. The larger the sample size, the more closely your sampling distribution will resemble a normal distribution, and the more precise your estimate of the population mean will be. -The shape of the population. If your population distribution is roughly bell-shaped and already resembles a normal distribution, the sampling distribution of the sample mean will be close to a normal distribution even with a small sample size. In general, many statisticians and data professionals consider a sample size of 30 to be sufficient when the population distribution is roughly bell-shaped, or approximately normal. However, if the original population is not normal—for example, if it’s extremely skewed or has lots of outliers—data professionals often prefer the sample size to be a bit larger. Exploratory data analysis can help you determine how large of a sample is necessary for a given dataset.
142
Population Proportion
The percentage of individuals or elements in a population that share a certain characteristic
143
As sample size gets larger, standard error gets __
smaller
144
Sampling Distributioni
A probability distribution of a sample statistic. Notes: Probability Distribution: represents the possible outcomes of a random variable, such as a coin toss or a die roll Sample statistics are based on randomly sampled data, and their outcome cannot be predicted with certainty.
145
Estimating population parameters through sampling is a powerful form of statistical inference.
Sampling distributions describe the uncertainty associated with a sample statistic, and help you make proper statistical inferences. This is important because stakeholder decisions are often based on the estimates you provide.
146
sampling with replacement
When a population element can be selected more than one time. Steps for Sampling with Replacement 1. Select an item randomly from the population. 2. Record the selected item. 3. Replace the item into the population. 4. Repeat the process until the desired sample size is achieved. Use Case: Bootstrapping, Montecarlo methods Notes Bootstrapping: A revival method used in data where samples are designed with replacement to estimate the distribution of a statistical. Monte Carlo Simulation: Used in simulation where different scenarios require random samples with replacement to model.
147
sampling without replacement
When a population element can be selected only one time Steps for Sampling without Replacement 1. Select an item randomly from the population. 2. Record the selected item. 3. Remove the selected item from the population. 4. Repeat the process until the desired sample size is achieved Use case: Lottery draws, survey sampling Notes: Lottery Draws: Drawing lottery numbers without replacement ensures that no number can appear more than once. Survey Sampling: Selecting participants for a survey where no individual can be chosen more than once.
148
Random Seed
A starting point for generating random numbers
149
pandas .sample()
Example: sampled_data = epa_data.sample(50, replace=True, random_state=42) sampled_data
150
Standard Error
Standard Error: The standard deviation of the means Example: We weighed 5 mice. We find the mean and the standard deviation (how spread out the data is). We do this experiment 6 times. So, we have 6 means and 6 standard deviations. You can then calculate the mean of the means (let's call that x) and the standard deviation of x. The standard deviation of x (I.e., the mean of the means) is the standard error.
151
Confidence Interval vs Confidence Level
A range of values that describes the uncertainty surrounding an estimate. Technically, 95% confidence means that if you take repeated random samples from a population, and construct a confidence interval for each sample using the same method, you can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. Use cases: Data professional may use a confidence interval to describe the uncertainty of an estimate for: -the average return on an investment for a stock portfolio -the average maintenance costs for factory machinery -the percentage of customers who will register for a rewards program -the percentage of website visitors who will click on an ad Confidence Level: The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling.
152
Frequentist vs Bayesian
Frequentist: (e.g., confidence interval) Bayesian: (e.g., credible interval)
153
Point Estimate
Uses a single value to estimate a population parameter
154
Interval Estimate
Uses a range of values to estimate a population parameter
155
Population Parameter vs Statistic
Population Parameter: A number describing the whole population (e.g., the population mean) vs Statistic: A number describing the sample (e.g., the sample mean) Ex// Pop Parameter: Proportion of all US residents that support the death penalty Sample Statistic: Proportion of 2000 randomly sampled participants that support the death penalty. Pop parameter: Median income of all college students in Massachusetts Sample Statistic: Median income of 850 college students in Boston and Wellesley. Pop parameter: Standard deviation of weights of all avocados in the region. Sample Statistic: Standard deviation of weights of avocados from one farm
156
What does a confidence interval include?
-Sample statistic -Margin of error -Confidence level Examples: Sample mean of our sample of penguins is 30 lbs. Notes: -Population Parameter: A number describing the whole population (e.g., the population mean) -Statistic: A number describing the sample (e.g., the sample mean) -Margin of error: The max expected difference between a pop parameter and a sample estimate. -Pop parameter: Standard deviation of weights of all avocados in the region. -Sample Statistic: Standard deviation of weights of avocados from one farm Confidence Interval: sample statistic +/- margin of error -Confidence Level: The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling.
157
What does a 95% confidence interval mean? What does it not mean?
It means: -95% of the intervals capture the population mean -5% of the intervals do not capture the population mean. Technically, 95% confidence means that if you take repeated random samples from a population, and construct a confidence interval for each sample using the same method, you can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling. Pro tip: Remember that a 95% confidence level refers to the success rate of the estimation process. ~~~ Note: confidence interval includes: -Sample statistic -Margin of error -Confidence level ~~~ 1) It does not mean: we are 95% confident that the [population parameter] is between [L] and [U]." Remember that when we're constructing a confidence interval we are estimating a population parameter when we only have data from a sample. We don't know if our sample statistic is less than, greater than, or approximately equal to the population parameter. And, we don't know for sure if our confidence interval contains the population parameter or not. 2) Misinterpretation 2: 95% refers to the percentage of data values that fall within the interval. A 95% confidence interval shows a range of values that likely includes the actual population mean. This is not the same as a range that contains 95% of the data values in the population. Example: Correlation Between Height and Weight At the beginning of the Spring 2017 semester a sample of World Campus students were surveyed and asked for their height and weight. In the sample, Pearson's r = 0.487. A 95% confidence interval was computed of [0.410, 0.559]. Interpretation: The correct interpretation of this confidence interval is that we are 95% confident that the correlation between height and weight in the population of all World Campus students is between 0.410 and 0.559. Example: Seatbelt Usage A sample of 12th grade females was surveyed about their seatbelt usage. A 95% confidence interval for the proportion of all 12th grade females who always wear their seatbelt was computed to be [0.612, 0.668]. Interpretation: The correct interpretation of this confidence interval is that we are 95% confident that the proportion of all 12th grade females who always wear their seatbelt in the population is between 0.612 and 0.668.
158
if you're working with a small sample size, and your data is approximately normally distributed, you should use the...
if you're working with a small sample size, and your data is approximately normally distributed, you should use the t-distribution rather than the standard normal distribution. For a t-distribution, you use t-scores to make calculations about your data. The graph of the t-distribution has a bell shape that is similar to the standard normal distribution. But, the t-distribution has bigger tails than the standard normal distribution does. The bigger tails indicate the higher frequency of outliers that come with a small dataset. As the sample size increases, the t-distribution approaches the normal distribution. When the sample size reaches 30, the distributions are practically the same, and you can use the normal distribution for your calculations.
159
Hypothesis Testing
A statistical procedure that uses sample data to evaluate an assumption about a population parameter
160
Statistical Significance
The claim that the results of a test or experiment are not explainable by chance alone If a result is statistically significant, that means it’s unlikely to be explained solely by chance or random factors.
161
Steps for performing a hypothesis test
1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value 4. Reject or fail to reject the null hypothesis Example: You want to determine if a coin is fair (i.e., confirm that it is not weighted to favor one side). You'll flip the coin 6 times and record the results. 1. Null Hypothesis: The coin is fair Alternative Hypothesis: The coin is not fair 2. Significance Level = 5% 3. P-value = 1.56% (the probability of a fair coin landing on tails 6 times in a row is 1.56%. Because there is a 50% chance a fair coin will land on heads or tails. .50*.50* .50 * .50 * .50 * .50 is 0.0156 or 1.56%. Anything more than 1.56% would mean there is stronger evidence for the alternative hypothesis (i.e., the coin is not fair). 4. Decide whether to reject or fail to reject the null hypothesis (statisticians never say "accept" just "fail to reject" because "accept" would suggest certainty which you never have with probability). If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis
162
Null Hypothesis
A statement that is assumed to be true unless there is convincing evidence to the contrary The null hypothesis typically assumes that the observed data occurs by chance. Note:s -The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample. -In statistics, the null hypothesis is often abbreviated as H sub zero (H0). -Null hypotheses often include phrases such as “no effect,” “no difference,” “no relationship,” or “no change.” -When written in mathematical terms, the null hypothesis always includes an equality symbol (usually =, but sometimes ≤ or ≥). *why? The null hypothesis represents the "boring" default assumption — that nothing interesting is happening. For example: "this drug has no effect," or "these two groups are identical." Equality symbols fit perfectly here because you're essentially saying things are the same, unchanged, or at baseline. Rule of thumb: Typically, the null hypothesis represents the status quo, or the current state of things. The null hypothesis assumes that the status quo hasn’t changed. Example#1: Mean weight An organic food company is famous for their granola. The company claims each bag they produce contains 300 grams of granola—no more and no less. To test this claim, a quality control expert measures the weight of a random sample of 40 bags. H0: μ = 300 (the mean weight of all produced granola bags is equal to 300 grams) Ha: μ ≠ 300 (the mean weight of all produced granola bags is not equal to 300 grams) Note: μ (pronounced "mew") is just the symbol for the mean (average) of a population.
163
Alternative Hypothesis
A statement that contradicts the null hypothesis and is accepted as true only if there is convincing evidence for it The alternative hypothesis typically assumes that the observed data does not occur by chance. Notes: -The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample. -In statistics, the alternative hypothesis is often abbreviated as H sub a (Ha). Alternative hypotheses often include phrases such as “an effect,” “a difference,” “a relationship,” or “a change.” -When written in mathematical terms, the alternative hypothesis always includes an inequality symbol (usually ≠, but sometimes < or >). *why? The alternative hypothesis is what you're trying to find evidence for — that something is happening, that there's a difference, a change, an effect. You can't really put an equals sign on that, because you're saying things are not the same. Rule of Thumb: The alternative hypothesis does NOT assume that the status quo hasn't changed; instead, it suggests a new possibility or different explanation. Example#1: Mean weight An organic food company is famous for their granola. The company claims each bag they produce contains 300 grams of granola—no more and no less. To test this claim, a quality control expert measures the weight of a random sample of 40 bags. H0: μ = 300 (the mean weight of all produced granola bags is equal to 300 grams) Ha: μ ≠ 300 (the mean weight of all produced granola bags is not equal to 300 grams) Note: Note: μ (pronounced "mew") is just the symbol for the mean (average) of a population.
164
Significance Level
Also known as alpha (α) the threshold at which you will consider a result statistically significant (i.e., the threshold at which you reject the null hypothesis--if it is less than the threshold, you reject the null hypothesis). Standardly data professionals set the significance level at 0.05 or 5%. Other common choices are 1% and 10%. You can adjust the significance level to meet the specific requirements of your analysis. A lower significance level means an effect has to be larger to be considered statistically significant. A significance level of 5% means you are willing to accept a 5% chance that you are wrong when you reject (or fail to reject) a null hypothesis.
165
P-Value
The p value, or probability value, tells you the statistical significance of a finding. Example: Null Hypothesis: The coin is fair Alternative Hypothesis: The coin is not fair P-value = 1.56% (the probability of a fair coin landing on tails 6 times in a row is 1.56%. Because there is a 50% chance a fair coin will land on heads or tails. .50*.50* .50 * .50 * .50 * .50 is 0.0156 or 1.56%) p-value= 1.56% (or 0.0156) So, anything more than 1.56% would mean that there is stronger evidence for the alternative hypothesis (i.e., the coin is not fair). ** Rule of Thumb: A low p-value indicates high statistical significance (meaning you reject the null hypothesis), while a high p-value indicates low or no statistical significance (meaning you fail to reject the null hypothesis.) **
166
Types of Errors in Hypothesis Testing
Type 1 Error (False Positive): The rejection of a null hypothesis that is actually true. (That is, you conclude that your result is statistically significant when it actually occurred by chance.) To reduce your chance of a type 1 error, choose a lower significance level. (A significance level of 5% means you are willing to accept a 5% chance that you are wrong when you reject (or fail to reject) a null hypothesis.) Type 2 Error (False Negative): The failure to reject a null hypothesis which is actually false. That is, you conclude that your result occurred by chance when it is actually statistically significant, Note: Choosing a lower significance level makes it more likely that you will have this error.
167
μ
μ (pronounced "mew") is just the symbol for the mean (average) of a population.
168
Example Problem: A researcher thinks that if knee surgery patients go to physical therapy twice a week (instead of 3 times), their recovery period will be longer. Average recovery times for knee surgery patients is 8.2 weeks. What is the hypothesis written mathematically? What is the null hypothesis? What is the alternative hypothesis?
the hypothesis written mathematically: H1: μ > 8.2 null hypothesis: μ = 8.2 alternative hypothesis: μ <= 8.2 Notes: H0: null hypothesis (number should be in subscript) H1: Alternative hypothesis (number should be in subscript) μ: average (pronounced "mew")
169
Type 1 Error
Also known as a false positive. Reject the null hypothesis when it’s actually true. This means that you report that your findings are significant when they have occurred by chance. The probability of making a type 1 error is represented by your alpha level (α), the p-value below which you reject the null hypothesis. To reduce your chance of making a Type I error, choose a lower significance level. (But this can increase your risk of making a type II error) **Type I errors are like false alarms**
170
Type 2 Error
Also known as a false negative. Fail to reject the null hypothesis when it’s actually false The probability of making a type II error is called Beta (β), which is related to the power of the statistical test (power = 1- β). You can reduce your risk of making a Type II error by ensuring your test has enough power. In data work, power is usually set at 0.80 or 80%. The higher the statistical power, the lower the probability of making a Type II error. **Type II errors are like missed opportunities**
171
Does a statistically significant result prove that a research hypothesis is correct?
No, for a research hypothesis to be correct, we would need 100% certainty. Because a p -value is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis ( H0).
172
Risks associated with a type 1 error
Resource Allocation: Making a Type I error can lead to wastage of resources. If a business believes a new strategy is effective when it’s not (based on a Type I error), they might allocate significant financial and human resources toward that ineffective strategy. Unnecessary Interventions: In medical trials, a Type I error might lead to the belief that a new treatment is effective when it isn’t. As a result, patients might undergo unnecessary treatments, risking potential side effects without any benefit. Reputation and Credibility: For researchers, making repeated Type I errors can harm their professional reputation. If they frequently claim groundbreaking results that are later refuted, their credibility in the scientific community might diminish. Notes: Type 1 Error: Also known as a false positive. Reject the null hypothesis when it’s actually true. This means that you report that your findings are significant when they have occurred by chance.
173
Risks associated with a type II error
Missed Opportunities: A Type II error can lead to missed opportunities for improvement or innovation. For example, in education, if a more effective teaching method is overlooked because of a Type II error, students might miss out on a better learning experience. Potential Risks: In healthcare, a Type II error might mean overlooking a harmful side effect of a medication because the research didn’t detect its harmful impacts. As a result, patients might continue using a harmful treatment. Stagnation: In the business world, making a Type II error can result in continued investment in outdated or less efficient methods. This can lead to stagnation and the inability to compete effectively in the marketplace. Notes: Type II error: Also known as a false negative. Fail to reject the null hypothesis when it’s actually false The probability of making a type II error is called Beta (β), which is related to the power of the statistical test (power = 1- β).
174
One-Sample Test
Determines whether or not a population parameter like a mean or proportion is equal to a specific value
175
Two-Sample Test
Determines whether or not two population parameters such as two means or two proportions are equal to each other Example: A/B testing
176
One-Sample Z-Test Assumptions
-The data is a random sample of a normally distributed population -The population standard deviation is known Note: z-score: A measure of how many standard deviations below or above the population mean a data point is Also called standard scores as they're based on the standard normal distribution, which has a mean of zero and standard deviation of 1.
177
Example One-Sample Test
You work at a chain restaurant that conducted a study on theirfood delivery turnaround. Typically the mean delivery time is 40 minutes. But the mean delivery of the sample is 38 minutes. null hypothesis: mean delivery = 40 (which is to say: the difference reported above is due to chance or sampling variability) p-value: the probability of observing a difference that is 2 minutes or greater if the null hypothesis is true. If p-value is less than 5% (chosen significance level), reject the null hypothesis *Why? A small p-value means the null hypothesis has become very hard to believe. If the p-value drops below that 5% threshold, you've decided the data is too unlikely under the null to keep accepting it, so you reject it. z-score in this example is -2.82 On the standard deviation, -2.82 (the observed test statistic) is far to the left. p-value is 0.0023 or 0.23% 0.0023 < significance level, so you must reject null hypothesis As such, it's unlikely that the 2 minute delivery difference is due to chance, which means that we can reject the null hypothesis in favor of the alternative hypothesis. As such, we should study what the delivery drivers in this sample are doing to deliver faster and train the drivers in other areas to follow suit. NOTE: As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like Python or other statistical software. In this example, you’re conducting a z-test, so your test statistic is a z-score of 2.53. Based on this test statistic, you calculate a p-value of 0.0057, or 0.57%.
178
Test Statistic
A value that shows how closely your observed data matches the distribution expected under the null hypothesis. In other words: A test statistic indicates how closely your data match the null hypothesis. Following formula gives you a test statistic z based on your sample data Z = ( x ‾ − μ ) / ( σ / √n ) z: z-score: A measure of how many standard deviations below or above the population mean a data point is x ‾ : (pronounced x-bar) the sample mean μ: (pronounced mew) the population mean σ: (pronounced sigma) the population standard deviation n: the sample size Note: The p-value is calculated from the test statistic
179
How do you interpret p-value?
If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis (can't say "accept" as that would imply absolute certainty that the null hypothesis is correct, and you never have 100% certainty with statistics.)
180
Every hypothesis test features:
-A test statistic that indicates how closely your data match the null hypothesis. For a z-test, your test statistic is a z-score; for a t-test, it’s a t-score. -A corresponding p-value that tells you the probability of obtaining a result at least as extreme as the observed result if the null hypothesis is true.
181
How do you calculate p-value?
As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like Python or other statistical software.
182
two main rules for drawing a conclusion about a hypothesis test:
-If your p-value is less than your significance level, you reject the null hypothesis. -If your p-value is greater than your significance level, you fail to reject the null hypothesis.
183
Data professionals and statisticians always say “fail to reject” rather than “accept.” Why?
This is because hypothesis tests are based on probability, not certainty—acceptance implies certainty. In general, data professionals avoid claiming certainty about results based on statistical methods.
184
null hypothesis significance testing or hypothesis testing
In quantitative research, data are analyzed through null hypothesis significance testing, or hypothesis testing. This is a formal procedure for assessing whether a relationship between variables or a difference between groups is statistically significant. Null and alternative hypotheses To begin, research predictions are rephrased into two main hypotheses: the null and alternative hypothesis. A null hypothesis (H0) always predicts no true effect, no relationship between variables, or no difference between groups. An alternative hypothesis (Ha or H1) states your main prediction of a true effect, a relationship between variables, or a difference between groups. Hypothesis testing always starts with the assumption that the null hypothesis is true. Using this procedure, you can assess the likelihood (probability) of obtaining your results under this assumption. Based on the outcome of the test, you can reject or retain the null hypothesis.
185
Problems with relying on statistical significance
-Researchers classify results as statistically significant or non-significant using a conventional threshold that lacks any theoretical or practical basis (5% significance level is most common; other common significance levels are 1% and 10%.). This means that even a tiny 0.001 decrease in a p value can convert a research finding from statistically non-significant to significant with almost no real change in the effect. -On its own, statistical significance may also be misleading because it’s affected by sample size. In extremely large samples, you’re more likely to obtain statistically significant results, even if the effect is actually small or negligible in the real world. This means that small effects are often exaggerated if they meet the significance threshold, while interesting results are ignored when they fall short of meeting the threshold. -The strong emphasis on statistical significance has led to a serious publication bias and replication crisis in the social sciences and medicine over the last few decades. Results are usually only published in academic journals if they show statistically significant results—but statistically significant results often can’t be reproduced in high quality replication studies. As a result, many scientists call for retiring statistical significance as a decision-making tool in favor of more nuanced approaches to interpreting results. That’s why APA guidelines advise reporting not only p values but also effect sizes and confidence intervals wherever possible to show the real world implications of a research outcome. Notes: Effect Sizes: Effect size tells you how meaningful the relationship between variables or the difference between groups is. It indicates the practical significance of a research outcome. Confidence Intervals: the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way.
186
Effect Sizes
Effect size tells you how meaningful the relationship between variables or the difference between groups is. It indicates the practical significance of a research outcome. While statistical significance shows that an effect exists in a study, practical significance shows that the effect is large enough to be meaningful in the real world. Statistical significance is denoted by p values, whereas practical significance is represented by effect sizes. Example: Statistical significance vs practical significance A large study compared two weight loss methods with 13,000 participants in a control intervention group and 13,000 participants in an experimental intervention group. The control group used scientifically backed methods for weight loss, while the experimental group used a new app-based method. After six months, the mean weight loss (kg) for the experimental intervention group (M = 10.6, SD = 6.7) was marginally higher than the mean weight loss for the control intervention group (M = 10.5, SD = 6.8). These results were statistically significant (p = .01). However, a difference of only 0.1 kilo between the groups is negligible and doesn’t really tell you that one method should be favored over the other. Adding a measure of practical significance would show how promising this new intervention is relative to existing interventions. How do you calculate effect size? The are dozens of measures for effect sizes but the most common effect sizes are Cohen's d and Pearson's r. Cohen's d: Cohen’s d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means. In general, the greater the Cohen’s d, the larger the effect size. Pearson's r: Pearson’s r, or the correlation coefficient, measures the extent of a linear relationship between two variables. The formula is rather complex, so it’s best to use a statistical software to calculate Pearson’s r accurately from the raw data. or Pearson’s r, the closer the value is to 0, the smaller the effect size.
187
Two-Sample T-Test for Means, assumptions:
-The two samples are independent of each other -For each sample, the data is drawn randomly from a normally distributed population -The population standard deviation is unknown
188
When do data professionals use a z-test vs t-test?
Data professionals standardly use -a z-test when the population standard deviation is known -a t-test when the population standard deviation is unknown and needs to be estimated from the data. Note: In practice, the population standard deviation is standardly unknown, because it is difficult to get complete data on large populations
189
t-score
The test statistic for t-test. t-scores are based on the t-distribution (as opposed to z-scores which are based on the standard normal distribution). Notes: As the sample size increases, the t-distribution approaches the normal distribution. (Use a t-distribution when sample is 30 or smaller, that is n < 30)
190
two-sample hypothesis text vs one-sample test
two-sample hypothesis test (or two-sample test): determines whether two population parameters, such as two means, are equal one-sample hypothesis test (or one-sample test); determines whether a population parameter is equal to a specific value.
191
Recall that p-value is the probability of observing results as or more extreme than those observed when the null hypothesis is true. In the context of hypothesis testing, “extreme” means
extreme in the direction(s) of the alternative hypothesis
192
one-tailed test
A one-tailed test results when the alternative hypothesis states that the actual value of a population parameter is either less than or greater than the value in the null hypothesis. A one-tailed test may be either left-tailed or right-tailed. A left-tailed test results when the alternative hypothesis states that the actual value of the parameter is less than the value in the null hypothesis. A right-tailed test results when the alternative hypothesis states that the actual value of the parameter is greater than the value in the null hypothesis. For example, imagine a test in which the null hypothesis states that the mean weight of a penguin population equals 30 lbs. In a left-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is less than (“<“) 30 lbs. In a right-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is greater than (“>”) 30 lbs.
193
two-tailed test
A two-tailed test results when the alternative hypothesis states that the actual value of the parameter does not equal the value in the null hypothesis. For example, imagine a test in which the null hypothesis states that the mean weight of a penguin population equals 30 lbs. In a two-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is not equal (“≠”) to 30 lbs. REMEMBER: p-value is the probability of observing results as or more extreme than those observed when the null hypothesis is true. In the context of hypothesis testing, “extreme” means extreme in the direction(s) of the alternative hypothesis (e.g., a z-score that is less than -1.75 or greater than 1.75).
194
H0 & Ha
H0: Null hypothesis Ha: Alternative hypothesis Note: The 0 and a should be subscripted
195
One-tailed versus two-tailed
You can use one-tailed and two-tailed tests to examine different effects. In general, a one-tailed test may provide more power to detect an effect in a single direction. However, before conducting a one-tailed test, you should consider the consequences of missing an effect in the other direction. For example, imagine a pharmaceutical company develops a new medication they believe is more effective than an existing medication. As a data professional analyzing the results of the clinical trial, you may wish to choose a one-tailed test to maximize your ability to detect the improvement. In doing so, you fail to test for the possibility that the new medication is less effective than the existing medication. And, of course, the company doesn’t want to release a less effective medication to the public. A one-tailed test may be appropriate if the negative consequences of missing an effect in the untested direction are minimal. For example, imagine that the company develops a new, less expensive medication that they believe is at least as effective as the existing medication. The lower price gives the new medication an advantage in the market. So, they just want to make sure the new medication is not less effective than the existing medication. Testing whether it’s more effective is not a priority. In this case, a one-tailed test may be appropriate.
196
Do you use a z-test 0r t-test to compare porportions?
For technical reasons, the best course is to use a z-test. Examples: Use a 2-sample z-test to compare the proportion of -defects among manufacturing products on two assembly lines -side effects to a new medicine for two trial groups -support for a new law among voters in two districts Example: A company has an office in Beijing and London. HR wants to determine if there is a different level of employee satisfaction at the two offices. -The team surveys a random selection of 50 employees at each office to see if they are satisfied with their current job. -Your goal: Determine if there is a statistically significant difference in the proportion of satisfied employees in London vs Beijing. If so, the HR team will deploy resources to investigate why employees at one office are more satisfied. -Results: London: 67% satisfied Beijing: 57% satisfied So, you conduct a 2-sample 2-text to analyze the data: 1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value (first find the test statistic, either the z-value or t-value, depending on the text type, then you can calculate the p-value) 4. Reject or fail to reject the null hypothesis 1. Null Hypothesis: There is no difference in the proportion of satisfied employees in London and Beijing. Alternative Hypothesis: There is a difference in the proportion of satisfied employees in London and Beijing. 2. Significance Level = 5% (company's standard for employee surveys) 3. If p-value < 5%: reject the null hypothesis If p-value > 5%: fail to reject the null hypothesis (calculate the p-value on your computer using python or a similar programming language) To find the p-value, you first need to find your test statistic z: Z = (p̂₁ - p̂₂) / sqrt( p̂₀(1-p̂₀)(1/n₁ + 1/n₂) ) p̂₁: sample proportion for the first group p̂₂: sample proportion for the second group 1/n₁: sample size for the first group 1/n₂: sample size for the second group p̂₀: the pooled proportion, a weighted average of the two proportions from your samples. (This has a separate formula.) z = 0.67 - 0.57/ sqrt( 0.62(1-0.62(1/50+1/50)) z = 1.03 p-value (calculate using python): 30.3% Remember: If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis 30.3% > 5% So, fail to reject the null hypothesis. There is not a statistically significant difference in satisfied employees in the London vs Beijing office. In other words, the observed difference in proportions is likely due to chance. Value Add: Your results will likely save the HR team a great deal of time and money. The HR team will not have to dedicate money to investigating reasons for the difference in the two offices. Note: ^ is pronounced hat (e.g., p̂₁ is p one hat) The hat (^) over a letter means it's an estimate calculated from your sample data, as opposed to the true population value. p would mean the true proportion of the whole population — which you usually can't know p̂ means the proportion you measured from your sample — your best guess at the true p
197
p̂₁ What does ^ mean?
^ is pronounced hat (e.g., p̂₁ is p one hat) The hat (^) over a letter means it's an estimate calculated from your sample data, as opposed to the true population value. p would mean the true proportion of the whole population — which you usually can't know p̂ means the proportion you measured from your sample — your best guess at the true p
198
Steps in a z-test or t-test
1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value (first find the test statistic, either the z-value or t-value, depending on the text type, then you can calculate the p-value) 4. Reject or fail to reject the null hypothesis
199
Common metrics analyzed in A/B tests
Average revenue per user: How much revenue does a user generate for a website? Average session duration: How long does a user remain on a website? Click rate: If a user is shown an ad, does the user click on it? Conversion rate: If a user is shown an ad, will that user convert into a customer?
200
Average revenue per user
How much revenue does a user generate for a website?
201
Average session duration
How long does a user remain on a website?
202
Click rate
Click rate: If a user is shown an ad, does the user click on it?
203
Conversion rate
Conversion rate: If a user is shown an ad, will that user convert into a customer?
204
3 Main features of a typical A/B test
1. Test design (e.g., A/B test) 2. Sampling (e.g., random selection) 3. Hypothesis testing (e.g., z-test or t-test)
205
randomized controlled experiment
In a randomized controlled experiment, test subjects are randomly assigned to a control group and a treatment group. (An A/B test is a basic version of what’s known as a randomized controlled experiment. ) The treatment is the new change being tested in the experiment. The control group is not exposed to the treatment. The treatment group is exposed to the treatment.
206
Experimental design
Experimental design refers to planning an experiment in order to collect data to answer your research question. For example, a data professional might design an experiment to discover whether: A new medicine leads to faster recovery time A new website design increases product sales A new fertilizer increases crop growth A new training program improves athletic performance
207
3 Key steps in designing an experiment
1. Define your variables (i.e., the independent and dependent variables) --independent: what you're interested in investigating (e.g., the medicine) --dependent: the effect you're interested in measuring (e.g., the recovery time) 2. Formulate your hypothesis 3. Assign test subjects to treatment and control groups
208
3 Key steps in designing an experiment: 1. Define your variables
Define the independent and dependent variables in the experiment -The independent variable refers to the cause you’re interested in investigating. A researcher changes or controls the independent variable to determine how it affects the dependent variable. “Independent” means it’s not influenced by other variables in the experiment. -The dependent variable refers to the effect you’re interested in measuring. “Dependent” means its value is influenced by the independent variable. Example: In your clinical trial, you want to find out how the medicine affects recovery time. Therefore: Your independent variable is the medicine—the cause you want to investigate. Your dependent variable is recovery time—the effect you want to measure.
209
3 Key steps in designing an experiment: 2. Formulate your hypothesis
Formulate the null & alternative hypothesis. Example: Your null hypothesis (H0) is that the medicine has no effect. Your alternative hypothesis (Ha) is that the medicine is effective.
210
3 Key steps in designing an experiment: 3. Assign test subjects to treatment and control groups:
Experiments such as clinical trials and A/B tests are controlled experiments. In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.
211
controlled experiment
In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.
212
nuisance factors
factors that can affect the result of an experiment, but are not of primary interest to the researcher.
213
Blocking
Blocking: arranging test subjects in groups, or blocks, that are similar to one another.
214
Regression Analysis or Regression Models
A group of statistical techniques that use existing data to estimate the relationships between a single dependent variable and one or more independent variables
215
PACE for regression analysis
Plan, Analyze, Construct, & Execute Plan: Understand your data in the problem context. Contextualize and understand the data and the problem. Analyze: EDA (exploratory data analysis), check model assumptions, & select model. Determine if we should move forward with building the model. Construct: Construct and evaluate the model. Determine how well your model fits the data. Execute: Interpret the model and share the story. Descriptions must take into account the context of the data.
216
Model Assumptions
Statements about the data that must be true to justify the use of particular data science techniques
217
Linear Regression
A technique that estimates the linear relationship between a continuous dependent variable and one or more independent variables. Examples: Relationship between the prices of a product (x values) and the number of sales (y value)
218
Dependent Variable (Y)
The variable a given model estimates, which is also referred to as a response or outcome variable.
219
Independent Variable (x)
A variable that explains trends in the dependent variable, which is also referred to as an explanatory or predictor variable.
220