Stats Flashcards

Question

Variance

Answer 1

The average of the squared difference of each data point from the mean. (It's the square of the standard deviation.)

Answer 2

Here's the formula again for population standard deviation: s = √SUM(x - )^2/n-1 Here's how to calculate population standard deviation: Step 1: Calculate the mean of the data—this is in the formula. Step 2: Subtract the mean from each data point x. These differences are called deviations. Data points below the mean will have negative deviations, and data points above the mean will have positive deviations. Step 3: Square each deviation to make it positive. Step 4: Add the squared deviations together. SUM Step 5: Divide the sum by the number of data points in the population (N). The result is called the variance. Step 6: Take the square root of the variance to get the standard deviation (s). Example 1: To better understand the different parts of the formula, let’s calculate the sample standard deviation of a small dataset: 2, 3, 10. You can do this in five steps: 1. Calculate the mean, or average, of your data values. (2 + 3 +10) ÷ 3 = 15 ÷ 3 = 5 2. Subtract the mean from each value. 2 - 5 = -3 3 - 5 = -2 10 - 5 = 5 3. Square each result. -3 * -3 = 9 -2 * -2 = 4 5 * 5 = 25 4. Add up the squared results and divide this sum by one less than the number of data values. This is the variance. (9 + 4 + 25) ÷ (3 -1) = 38 ÷ 2 = 19 5. Finally, find the square root of the variance. √19 = 4.36 The sample standard deviation is 4.36. Example 2: Meteorologists use standard deviation for weather forecasting to understand how much variation exists in daily temperatures in different places and to make more accurate predictions about the weather. City A City B Mean Temp 66 degrees 64 degrees Standard Deviation 3 degrees 16 degrees As the standard deviation is higher in city b, there is more variation in daily temperature in city b than in city a where the weather is more consistent. If the meteorologist just relied on the mean in city b, they could be off as far as 16 degrees for the forecast, which would, understandably, result in many grumpy residents. Knowing the sd gives the meteorologist in city b a useful measure of variance to consider and a level of confidence about their prediction Regardless, a higher standard deviation does make it harder to accurately predict the weather.

Answer 3

-Ad revenues -Stock prices -Employee salaries -weather forecasts Weather Example: Meteorologists use standard deviation for weather forecasting to understand how much variation exists in daily temperatures in different places and to make more accurate predictions about the weather. City A City B Mean Temp 66 degrees 64 degrees Standard Deviation 3 degrees 16 degrees As the standard deviation is higher in city b, there is more variation in daily temperature in city b than in city a where the weather is more consistent. If the meteorologist just relied on the mean in city b, they could be off as far as 16 degrees for the forecast, which would, understandably, result in many grumpy residents. Knowing the sd gives the meteorologist in city b a useful measure of variance to consider and a level of confidence about their prediction Regardless, a higher standard deviation does make it harder to accurately predict the weather.

Answer 4

Example: Real estate prices Imagine you're a data professional working for a real estate company. The real estate agents on your team like to inform their clients about the variation in rental prices in different residential areas. Part of your job is calculating the standard deviation of monthly rental prices for apartments in specific neighborhoods, and sharing this information with your team. Let’s say you have sample data on monthly rental prices for one-bedroom apartments in two different neighborhoods: Emerald Woods and Rock Park. Assume you calculate the mean and standard deviation for each dataset. Emerald Woods Apartment #1 #2 #3 #4 #5 Monthly Rent $900 $950 $1,000 $1,050 $1,100 Mean: $1,000 Standard deviation: $79.05 Rock Park Apartment #1 #2 #3 #4 #5 Monthly Rent $500 $650 $1,000 $1,350 $1,500 Mean: $1,000 Standard deviation: $431.56 Both neighborhoods have the same mean rental price of $1,000 per month. However, the standard deviation for rental prices in Rock Park ($431.56) is much higher than the standard deviation for rental prices in Emerald Woods ($79.05). This means that there is a lot more variation in rental prices in Rock Park. This is useful information for your agents. For example, they can tell clients that it may be easier for them to find a more affordable apartment in Rock Park that is far below the mean of $1,000. Standard deviation helps you quickly understand the variation in prices in any given neighborhood.

Answer 5

Measures of Position: Determine the position of a value in relation to other values in a dataset Most Common Measures of Position: Percentiles, Quartiles, Interquartile range, five number summary

Answer 6

The value below which a percentage of data falls Percentiles show the relative position or rank of a particular value in a dataset. Example: Many universities require students to take standardized tests (e.g., SAT and ACT in the US). When a student receives their test score, they usually also receive a corresponding percentile. Let's say a test score falls in the 99 percentile, this means that it's higher than 99% of test scores. If it fell in the 77the percentile, it's higher than 77% of test scores. Percentiles are useful for comparing values. A percentile is a measure of position

Answer 7

Divides the values in a dataset into four equal parts Each quarter contains 25% of the data in your dataset. Q1/25th percentile/lower quartile: 25% of the data is below Q1. 75% of the data is above it. Q2/50th percentile: Q2 is the median. 50% of the data is below Q2. 50% of the data is above it. Q3/75th percentile: 75% of the data is below Q3 and 25% is above it.

Answer 8

Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more)

Answer 9

The distance between the first quartile (Q1) and the third quartile (Q3). (The middle 50% of your data.) (The distance between the 25th and 75th percentiles.) Technically, IQR is a measure of dispersion, because it measures the spread of the middle 50% of your data. IQR = Q3 - Q1 Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more) IQR = 25-13=12

Answer 10

The minimum The first quartile (q1) The median, or second quartile, (Q2) The third quartile (Q3) The maximum Useful because it gives you an idea of the overall distribution of your data from the extreme values to the center. Example: Player: #7 #3 #8 #1 #2 #6 #4 #5 Goals Scored: 11 12 14 18 22 23 27 33 1. Find the median of your full dataset: 20 (There are an even number of values, so find the middle two values and determine the mean: 18+22/2=20). Q2 = 20 2. Find the median of the lower half of your dataset. The lower half of the dataset: 11, 12, 14, 18. Again, there is an even number of values, so find the middle two values and determine their mean: 12+14=26/2=13. Q1 = 13 3. Find the median of the upper half of your dataset. The upper half of your dataset: 22, 23, 27, 33. Again, there is an even number of values, so find the middle two values and determine their mean: 23+27=50/2=25. Q3 =25 So, this gives you a clear idea of player performance: Q1: 13 (i.e., lower quartile/lower 25% players scored 13 goals or less) Q2: 20 Q3: 25 (i.e., upper quartile/upper 25% scored 25 goals or more) IQR = 25-13=12 The minimum = 11 The first quartile (q1) = 13 The median, or second quartile, (Q2) = 20 The third quartile (Q3) = 25 The maximum = 38

Answer 11

|-------------------------|-------------------------|-------------------------|-------------------------| | | | Min Lower Quartile (Q1) Median (Q2) Upper Quartile (Q3) Max (whisker) (whisker)\ |----------------------------------------------------| (IQR) Example: Q1 = 13 Q2/Median = 20 Q3 = 25 IQR: 25-13 = 12

Answer 12

Measures of position, which determine the position of a value in relation to other values in a dataset, can be used to better understand: -public health data such as life expectancy -macroeconomic data such as household income -business data such as product sales *REMINDER: The Most Common Measures of Position: Percentiles, Quartiles, Interquartile range, five number summary

Answer 13

Note: Percentiles and percentages are distinct concepts. For example, say you score 90/100, or 90%, on a test. This doesn’t necessarily mean your score of 90% is in the 90th percentile. Percentile depends on the relative performance of all test takers. If half of all test takers score above 90%, then a score of 90% will be in the 50th percentile. A percentile is the value below which a percentage of data falls. Percentiles divide your data into 100 equal parts. Percentiles give the relative position or rank of a particular value in a dataset. For example, percentiles are commonly used to rank test scores on school exams. Let’s say a test score falls in the 99th percentile. This means the score is higher than 99% of all test scores. If a score falls in the 75th percentile, the score is higher than 75% of all test scores. If a score falls in the 50th percentile, the score is higher than half, or 50%, of all test scores. A percentile is a measure of position

Answer 14

Percentiles are useful for comparing values and putting data in context. For example, imagine you want to buy a new car. You’d like a midsize sedan with great fuel economy. In the United States fuel economy is measured in miles per gallon of fuel, or mpg. The sedan you’re considering gets 23 mpg. Is that good or bad? Without a basis for comparison, it’s hard to know. However, if you know that 23 mpg is in the 25th percentile of all midsize sedans, you have a much clearer idea of its relative performance. In this case, 75% of all midsize sedans have a higher mpg than the car you’re thinking about buying.

Answer 15

Example 1: Find the 40th percentile in the data array import numpy as np data - np.array([10, 20, 30, 40, 50]) np.percentile(data, 40) Example 2: Find the 25th, 50th and 75th percentiles of the data array np.percentile(data, [25, 50, 75]) Note: axis=0 percentile down columns axis=1 percentile across rows SO arr = np.array([ [10, 20, 30], [40, 50, 60] ]) np.percentile(arr, 50, axis=0) the result will give you the median/50th percentile of each column [25., 35., 45.]

Answer 16

Range: The difference between the largest and smallest value in a dataset. Variance: the average of the squared difference of each data point from the mean Standard Deviation: it calculates the typical distance of a data point from the mean of your dataset. (The square root of the variance.) Interquartile Range (IQR): The difference between the third quartile (Q3) and the first quartile (Q1). It indicates the spread of the middle half or the middle 50% of your data.

Answer 17

Percentile: The value below which a percentage of data falls. Divides the values in a dataset into 100 equal parts. Quartile: Divides the values in a dataset into 4 equal parts.

Answer 18

Measures of Central Tendency: describe the center of the dataset Measures of Dispersion: describe the spread of your dataset Measures of Position: show the relative location of your data values

Answer 19

1) Discovering: The goal is to understand the context of the data. This is often done by discussing the data with project stakeholders and by reading documentation about the dataset and the data collection process. 2) Structuring 3) Cleaning: Deal with issues like missing data and incorrect values A common step after cleaning: compute descriptive stats to summarize the dataset 4) Joining 5) Validating 6) Presenting

Answer 20

1) Determine if the events are independent or dependent Independent Events: One event's outcome has no affect on the other event's outcome Dependent Events: The second's event's probability is based on the outcome of the second event. Example 1, Independent Events; What's the probability of rolling snake eyes? Each role is an independent event, because neither roll impacts the other. P(A and B) = P(A) * P(B) The probability of a and b occurring together (i.e., rolling 2 snake eyes) = the probability of event a * probability of event b 1/6 * 1/6 = 1/36 So, out of all possible 36 rolls of two dice, 1 of them is snake eyes. Example 2, Independent Events: What's the probability of flipping heads 3 times in a row? 1/2 * 1/2 * 1/2 = 1/8 So, out of all possible 8 flips of the three coins, 1 is all coins flipped heads up. Example 1, Dependent Events: (i.e., the second's event's probability is based on the outcome of the second event.) P(A and B) = P(A) x P(B|A) The probability that event A and B happened = the probability that event A happened * the probability that event B happened given that event A happened. What's the probability that you'll draw an Ace, hold on to it, and then draw a King? Ace: 4/52 (4 out of 52 cards are Aces) King: 4/51 (51 because you already took out an Ace, so 4/51 cards are Kings) 4/52 * 4/51 = 16/2652 > 1/167 chance (a one out of a 167 chance)

Answer 21

Use the addition rule of probability for situations like the following examples: -Example 1: What's the chance of drawing a Heart OR a Face card? -Example 2: If there are students who play soccer and students who play tennis and some students who play both, what's the probability that a random student selected plays BOTH? Mutually Exclusive Events: The events cannot both occur (e.g., you roll a 2 or a 5 in a single dice roll). P(A or B) = P(A) + P(B) The probability of A or B occurring is the probability of A occurring + the probability of B occurring. Mutually Exclusive Event Example: What's the probability of rolling a 2 or a 5 Probability of rolling a 2: 1/6 Probability of rolling a 5: 1/6 1/6 + 1/6 = 2/6 Example 1 of Events that are NOT mutually exclusive: What's the likelihood of drawing a Heart or a Face Card (e.g., the Queen of Hearts is a Heart AND a face card). P(A or B) = P (A u B) = P(A) + P(B) - P(A and B) Probability of getting event A or B = Probability of A in union of B, which = probability of event A occurring + the probability of event B occurring - the probability of event A and event B occurring together. Hearts: 13/52 (13 hearts in a deck of cards) Face Cards: 12/52 (12 cards) Heart & Face Cards: 3/52 (we must subtract these 3 as otherwise, we're counting the J of Hearts, Q of Hearts, and Q of Hearts twice). 13/52 + 12/52 - 3/52 = 22/52 There's a 22/52 probability that you'll draw a Heart or a Face Card. Example 2 of Events that are NOT mutually exclusive: 50% of the student body plays soccer. 20% of the student body plays tennis. 10% of the student body plays both. What's the probability that a random student plays soccer, tennis, or both? 50% + 20% - 10% = 60% probability that the selected student plays soccer, tennis, or both

Answer 22

import numpy as np import pandas as pd import matplotlib.pyplot as plt

Answer 23

Example 1: import pandas as pd education_districtwise - pd.read_csv Example 2: import pandas as pd epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0) Note: The index_col parameter can be set to 0 to read in the first column as an index (and to avoid "Unnamed: 0" appearing as a column in the resulting DataFrame).

Answer 24

Data professionals use the describe() function as a convenient way to calculate many key stats all at once. For a numeric column, describe() gives you the following output: count: Number of non-NA/null observations mean: The arithmetic average std: The standard deviation min: The smallest (minimum) value 25%: The first quartile (25th percentile) 50%: The median (50th percentile) 75%: The third quartile (75th percentile) max: The largest (maximum) value Note: describe() excludes missing values (NaN) in the dataset from consideration. You may notice that the count, or the number of observations for OVERALL_LI (634), is fewer than the number of rows in the dataset (680). Dealing with missing values is a complex issue outside the scope of this course. You can also use the describe() function for a column with categorical data, like the STATNAME column. For a categorical column, describe() gives you the following output: count: Number of non-NA/null observations unique: Number of unique values top: The most common value (the mode) freq: The frequency of the most common value Example: education_districtwise['STATNAME'].describe() Output: count 680 unique 36 top STATE21 freq 75 Name: STATNAME, dtype: object

Answer 25

Example: epa_data is the dataset aqi is the column np.std(epa_data['aqi'], ddof=1) Note: ddof: delta degrees of freedom When NumPy computes standard deviation, it uses this formula: std=√ ∑(xi−xˉ)^2/ N−ddof N = number of observations xˉ = mean ddof = how much you subtract from N in the denominator Why do you need to subtract 1? * When you estimate the mean from the same data you're measuring the variability of, you "use up" one degree of freedom. That makes the raw variance too small unless you correct for it. When to use ddof=1: -Your data is a sample - You're doing inference, comparison, or modeling as opposed to pure descriptive statistics. (That is, you're asking "what does this dataset tell us about something bigger" and not "what does this exact dataset look like?" The latter is pure descriptive statistics. -You're following stats conventions (e.g., coursework, reports, analysis) - You want results comparable to pandas, R, or textbooks When to use ddof=0: - You data represents the entire population - You're doing pure descriptive statistics - You care about the exact dispersion of this dataset, not a generalization (i.e., you are looking at this specific dataset and not trying to make generalizations about what next year will look like or other cities etc.) - You're matching NumPy defaults or ML preprocessing pipelines NOTE: Pandas defaults to ddof=1 NumPy defaults to ddof=0 so np.std(data) and data.std() are NOT equivalent while, np.std(data, ddof=1) and data.std() are equivalent

Answer 26

ddof: delta degrees of freedom` When NumPy computes standard deviation, it uses this formula: std=√ ∑(xi−xˉ)^2/ N−ddof N = number of observations xˉ = mean ddof = how much you subtract from N in the denominator Why do you need to subtract 1? * When you estimate the mean from the same data you're measuring the variability of, you "use up" one degree of freedom. That makes the raw variance too small unless you correct for it. When to use ddof=1: -Your data is a sample - You're doing inference, comparison, or modeling as opposed to pure descriptive statistics. (That is, you're asking "what does this dataset tell us about something bigger" and not "what does this exact dataset look like?" The latter is pure descriptive statistics. -You're following stats conventions (e.g., coursework, reports, analysis) - You want results comparable to pandas, R, or textbooks When to use ddof=0: - You data represents the entire population - You're doing pure descriptive statistics - You care about the exact dispersion of this dataset, not a generalization (i.e., you are looking at this specific dataset and not trying to make generalizations about what next year will look like or other cities etc.) - You're matching NumPy defaults or ML preprocessing pipelines NOTE: Pandas defaults to ddof=1 NumPy defaults to ddof=0 so np.std(data) and data.std() are NOT equivalent while, np.std(data, ddof=1) and data.std() are equivalent

Answer 27

The branch of math that deals with measuring and quantifying uncertainty (i.e., the chance of something happening).

Answer 28

Based on stats, experiments, and mathematical measurements 2 Types: Classical Probability: Based on formal reasoning about events with equally likely outcomes. Empirical Probability: Based on experimental or historical data

Answer 29

Based on personal feelings, experience, or judgment

Answer 30

One of the two types of objective probability: classical probability and empirical probability Classical Probability: Based on formal reasoning about events with equally likely outcomes. Calculation: classical probability = number of desired outcomes / total number of possible outcomes Example: Getting a specific card like the Ace of Hearts in a deck has a 1/52 chance Note: Most events are not equally likely (e.g., tomorrow's weather is not 50% chance of rain or snow; it may be an 80% chance of rain), so we need empirical probability.

Answer 31

One of the two types of objective probability: classical probability and empirical probability Empirical Probability: Based on experimental or historical data It represents the likelihood of an event occurring based on a previous results of an experiment or past events. Calculation: empirical probability = number of times a specific event occurs/total number of events Example: taste test with 100 people. Want to know the probability that a person prefers vanilla over strawberry. 80/100 people prefer it. So the probability that a person prefers vanilla over strawberry is 80%.

Answer 32

Probability measures the likelihood of RANDOM events. The result of a random event cannot be predicted with certainty. If the probability of an event = 0, there is a 0% chance that the event will occur If the probability of an event equals 1, there is a 100% chance that the event will occur .5 means there is a 50% chance of the event occurring etc. Rule of Thumb: If the probability of an event is close to 0, there is a small chance that the event will occur If the probability is close to 1, there is a strong chance that the event will occur Example: Wouldn't want to buy a stock that has 0.05 probability of going up, but if it were, 0.95 that would likely be a good investment. 0.05 = 5% 0.95 = 95

Answer 33

A process whose outcome cannot be predicted with certainty. All random experiments/statistical experiments have 3 things in common: 1) The experiment can have more than one possible outcome. 2) You can represent each possible outcome in advance. 3) The outcome of the experiment depends on chance.

Answer 34

Number of desired outcomes/total number of possible outcomes

Answer 35

In stats, the results of a random experiment/statistical experiment is called an outcome. Example: If you roll a die, there are six possible outcomes: 1, 2, 3, 4, 5, 6

Answer 36

In stats, an event is a set of one or more outcomes. Example: If you roll a die, an event might be rolling an event number. The event of rolling an even number consists of the outcomes 2, 4, 6. The event of rolling an odd number consists of the outcomes 1, 3, 5

Answer 37

The probability that an event will occur is expressed as a number between 0 and 1. Probability can also be expressed as a percent. If the probability of an event equals 0, there is a 0% chance that the event will occur. If the probability of an event equals 1, there is a 100% chance that the event will occur. There are different degrees of probability between 0 and 1. If the probability of an event is close to zero, say 0.05 or 5%, there is a small chance that the event will occur. If the probability of an event is close to 1, say 0.95 or 95%, there is a strong chance that the event will occur. If the probability of an event equals 0.5, there is a 50% chance that the event will occur—or not occur.

Answer 38

Note that when you say the probability of getting heads is 50%, you aren’t claiming that any actual sequence of coin tosses will result in exactly 50% heads. For example, if you toss a fair coin ten times, you may get 4 heads and 6 tails, or 7 heads and 3 tails. However, if you continue to toss the coin, you can expect the long-run frequency of heads to get closer and closer to 50%.

Answer 39

P: indicates the probability of an event A: represents an individual event B: represents an individual event ' : means an event doesn't occur so P(A'): the probability of event A not happening P(A): the probability of event A happening Examples: -The probability of event A is written as P(A). -The probability of event B is written as P(B). -For any event A, 0 ≤ P(A) ≤ 1. In other words, the probability of any event A is always between 0 and 1. -If P(A) > P(B), then event A has a higher chance of occurring than event B. -If P(A) = P(B), then event A and event B are equally likely to occur.

Answer 40

In stats, the complement of an event is an event not occurring.

Answer 41

In stats, the complement of an event is an event not occurring. Complement rule says that the probability that event A does not occur is P(A') = 1 - P(A) This rule applies to events that are mutually exclusive Note: Mutually Exclusive Events: Two events are mutually exclusive if they cannot occur at the same time.

Answer 42

Two events are mutually exclusive if they cannot occur at the same time. Example: You can't be in China and Argentina at the same time. You can't roll a 2 and a 6 in the same single roll of the die.

Answer 43

P(A or B) = P(A) + P(B) Example: What's the probability of rolling either a 2 or a 4 in a single roll? P(A or B) = P(A) + P(B) P(rolling a 2 or rolling a 4) = P(rolling a 2) + P(rolling a 4) P(1/6) +P(1/6) = 2/6 = 1/3 so it's 33%

Answer 44

Two events are independent if the occurrence of one event does not change the probability of the other event. Example: Checking out a book from the library does not affect tomorrow's weather.

Answer 45

P(A and B) = P(A) * P(B) P(first toss tails and second toss heads) = P(first toss tails) * P(2nd toss heads) = .5 *.5 = .25

Answer 46

The probability of an event occurring given that another event has already occurred

Answer 47

Two events are dependent if the occurrence of one event changes the probability of the other event Examples: If you want to travel to another country, you need to have a passport If you want to access a website, you need internet access Conditional Probability Calculation: P(A and B) = P(A) * P(B|A) OR P(B|A) = P(A and B) / P(A) P(A and B) = probability of event A and event B P(A) = probability of event A P(B|A) = probability of event B given event A Note: B|A: vertical bar means that event "B" depends on event "A" happening Example: What's the probability of drawing an ace from a deck of cards and then another ace from that same deck? P(A): chance of getting an ace on the first draw: 4/52 P(B|A): chance of getting an ace on the second draw: 3/51 P(A and B): ace on the first draw and second draw: P(A) * P(B|A) =4/52 * 3/52 =1/ 221 = 0.5% Example 2: What's the probability that you'll get accepted by college z and receive a scholarship from college z? acceptance rate: 10/100 applicants: 10% scholarships awarded: 2/100 accepted students: 2% 10/100 * 2/100 = 1/500=0.2% Business Use Case: Use conditional probability to predict how an event like an ad campaign will impact sales revenue and then share findings with stakeholders so they can make more informed business decisions.

Answer 48

This is for dependent events. Note: Two events are dependent if the occurrence of one event changes the probability of the other event Conditional Probability Calculation: P(A and B) = P(A) * P(B|A) OR P(B|A) = P(A and B) / P(A) P(A and B) = probability of event A and event B P(A) = probability of event A P(B|A) = probability of event B given event A Note: B|A: vertical bar means that event "B" depends on event "A" happening Example: What's the probability of drawing an ace from a deck of cards and then another ace from that same deck? P(A): chance of getting an ace on the first draw: 4/52 P(B|A): chance of getting an ace on the second draw: 3/51 P(A and B): ace on the first draw and second draw: P(A) * P(B|A) =4/52 * 3/52 =1/ 221 = 0.5% Example 2: What's the probability that you'll get accepted by college z and receive a scholarship from college z? acceptance rate: 10/100 applicants: 10% scholarships awarded: 2/100 accepted students: 2% 10/100 * 2/100 = 1/500=0.2% Business Use Case: Use conditional probability to predict how an event like an ad campaign will impact sales revenue and then share findings with stakeholders so they can make more informed business decisions. Example 3: Example: online purchases Let’s explore another example. Imagine you are a data professional working for an online retail store. You have data that tells you 20% of the customers who visit the store’s website make a purchase of $100 or more. If a customer spends $100, they are eligible to receive a free gift card. The store randomly awards gift cards to 10% of the customers who spend at least $100. You want to calculate the probability that a customer spends $100 and receives a gift card. Receiving a gift card depends on first spending $100. So, this is a conditional probability because it deals with two dependent events. Let's apply the conditional probability formula: P(A and B) = P(A) * P(B|A) You want to calculate the probability of both event A and event B occurring. Let’s call event A $100 and event B gift card. The probability of event A is 0.2, or 20%. The probability of event B is 0.1, or 10%. P($100 and gift card) = P($100) * P(gift card given $100) = 0.2 * 0.1 = 0.02, or 2% So, the probability of a customer spending $100 or more and receiving a free gift card is 0.2 * 0.1 = 0.02, or 2%.

Answer 49

P(A|B) = P(B|A) * P(A) / P(B) prior probability in Bayesian statistics: the probability of an event before new data is collected posterior probability in Bayesian statistics: the updated probability of an event based on new data Example: Let's say a medical condition is related to age, you can use Bayesian probability to determine if a person has the condition based on age. prior probability: the probability of a person having the condition posterior probability: the probability of a person having the condition if they're in a certain age group The calculation: P(A|B) = P(B|A) * P(A) / P(B) In English: for any 2 events A and B, the probability of A, given B = the probability of A multiplied by the probability of B given A / probability of B Math Terms: P(A): prior probability/probability a person has the condition P(A|B) posterior probability/probability a person has the condition if they're in a certain age group Note: Sometimes statisticians use the term "likelihood" to refer to the probability of event B given event A, and the term "evidence" to refer to the probability of event B. P(B|A): likelihood P(B): evidence SO P(A|B) = P(B|A) * P(A) / P(B) posterior = likelihood * prior / evidence posterior: prob a person has the condition if they're in a certain age group likelihood: among the people with the condition, the fraction that are in the age group (e.g., 65+) prior: the prob that a person has the condition evidence: fraction of all people are 65+ Example 2: spam filter A well-known application of Bayes’s theorem in the digital world is spam filtering, or predicting whether an email is spam or not. In practice, a sophisticated spam filter deals with many different variables, including the content of the email, its title, whether it has an attachment, the domain type of the sender address (.edu or .org), and more. However, we can use a simplified version of a Bayesian spam filter for our example. Let’s say you want to determine the probability that an email is spam given a specific word appears in the email. For this example, let’s use the word “money.” You discover the following information: -The probability of an email being spam is 20%. -The probability that the word “money” appears in an email is 15%. -The probability that the word “money” appears in a spam email is 40%. In this example, your prior probability is the probability of an email being spam. Your posterior probability, or what you ultimately want to find out, is the probability that an email is spam given that it contains the word “money.” The new data you will use to update your prior probability is the probability that the word “money” appears in an email and the probability that the word “money” appears in a spam email. When you work with Bayes’s theorem, it’s helpful to first figure out what event A is and what event B is—this makes it easier to understand the relationship between events and use the formula. Let’s call event A a spam email and event B the appearance of the word “money” in an email. Now, you can re-write Bayes’s theorem using the word “spam” for event A and the word “money” for event B. P(A|B) = P(B|A) * P(A) / P(B) P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) You want to find out the following: P(Spam | Money), or posterior probability: the probability that an email is spam given that the word “money” appears in the email Now, enter your data into the formula: -P(Spam), or prior probability: the probability of an email being spam = 0.2, or 20% -P(Money), or evidence: the probability that the word “money” appears in an email = 0.15, or 15% -P(Money | Spam), or likelihood: the probability that the word “money” appears in an email given that the email is spam = 0.4, or 40% -P (Spam | Money) = P(Money | Spam) * P(Spam) / P(Money) = 0.4 * 0.2 / 0.15 = 0.53333, or about 53.3%. So, the probability that an email is spam given that the email contains the word “money” is 53.3%.

Answer 50

The probability of an event before new data is collected

Answer 51

The updated probability of an event based on new data

Answer 52

A powerful method for analyzing and interpreting data in modern data analytics Bayes' theorem is used in a variety of fields including, but not limited to: AI, medical testing, and financial institutions, online retailers, marketers, etc. Examples: -Financial institutions use Bayesian stats to rate the risk of lending money to borrowers or to predict the success of an investment. -online retailers use Bayesian algorithms to predict whether or not users will like certain products and services -marketers rely on Bayes' theorem for identifying positive or negative responses for customer feedback.

Answer 53

Basic: P(A|B) = P(B|A) ( P(A)/P(B) Expanded: P(A|B) = P(B|A) * P(A)/ P(B|A) * P(A) + P(B|not A) * P(not A) Use the expanded version when you don't need to know the probability of event B to use the expanded version. The expanded version is often used to evaluate: -medical diagnostic tests -quality control tests -software tests Example: Evaluate the accuracy of a diagnostic test -1% of the population has the medical condition -If a person has the condition, there's a 95% chance that the test is positive -If a person does not have the condition, there's still a 2% chance that the test is positive. prior probability = the probability that a person has the medical condition (1%) posterior probability = the probability that the condition is present GIVEN that the test is positive. (95%) Event A = actually having the medical condition Event B = testing positive *Note: these events are different as you can test positive and not have the allergy. P(A) : the probability that a person has the condition = 1% P(B|A}: probability of testing positive given that the person has the condition = 95% P(B|not A): probability of testing positive given that the person does NOT have the condition = 2% Use the complement rule to determine the probability of not having the condition. P(A')=1-P(A) the probability that event A does not occur is 1 - probability that the event does occur P(A')=1-.01=.99=99% So, 99% probability that a person does not have the condition. Use the expanded version of Bayes' Theorem, because you don't know the probability of event B (the probability that a person gets a positive test result). P(A|B) = .95 * .01/.95*.01+.02*.99 = 0.324 = 32.4% So, P(A|B) the probability that the condition is present given that the test is positive is 32.4% Low because the allergy is rare to begin with. It's not likely that a person would test positive AND have the allergy. Most people don't test positive.

Answer 54

Test result that indicates something is present when it's really not

Answer 55

Test result that indicates something is not present when it really is

Answer 56

Describes the likelihood of the possible outcomes of a random event

Answer 57

Represents the values for the possible outcomes of a random event 2 Types of Random variables: -Discrete random variables: has a countable number of possible values (e.g., whole numbers that can be counted, so number of people in a room, etc.) -Continuous random variables: takes all the possible values in some range of numbers (e.g., decimal values like height, weight, etc.) General Rule of Thumb: -If you can count the number of outcomes, it's discrete (e.g., the number of times a coin flip results in heads) -If you measure the outcomes, it's continuous (e.g., a person's time on a marathon) Note: -Discrete distributions represent discrete random variables -Continuous distributions represent continuous random variables

Answer 58

The set of all possible values for a random variable

Answer 59

represent discrete random variables, or discrete events

Answer 60

A discrete distribution that models the probability of events with only two possible outcomes: success or failure This definition assumes that -each event is independent (i.e., an event does not affect the probability of the other event/s) -the probability of success is the same for each event Note: You can label any outcome as a "success". For example, each coin toss has only two possible options: heads or tails. Either heads or tails could be labeled as a success based on the needs of your analysis. *But whatever label you apply to the outcomes, they MUST be mutually exclusive. Business use cases: Binomial distributions are used to model the probability that -a new medication generates side effects -a credit card transaction is fraudulent -a stock price rises or falls in value In machine learning, a binomial distribution is often used to classify data. Examples: train an algorithm to recognize whether an image is or is not a cat. Binomial Distribution represents a random event called a binomial experiment.

Answer 61

Two outcomes are mutually exclusive if they cannot occur at the same time

Answer 62

A type of random experiment -The experiment consists of a number of repeated trials -Each trial has only two possible outcomes -The probability of success is the same for each trial -Each trial is independent Example 1 of a Binomial Experiment: Tossing a Coin 10 times in a row. -10 repeated coin tosses -two possible outcomes: heads or tails -the probability of success for each toss is the same: 50% -the outcome of any one coin toss does not affect the outcome of any other coin toss Example 2 of a Binomial experiment: You want to know how many customers return an item to a department store on a given day. 100 customers visit the store each day. 10% of all customers who visit the store make a return. You label a return as a success. This is a binomial experiment because -100 customer visits -2 possible outcomes: return or no return -the probability of success for each customer visit is the same: 10% -the outcome of one customer visit does not affect the outcome of any other customer visit

Answer 63

a process whose outcome cannot be predicted with certainty. All random experiments have 3 things in common: -The experiment can have more than one possible outcome -you can represent each possible outcome in advance -the outcome of the experiment depends on chance

Answer 64

Random Experiment: a process whose outcome cannot be predicted with certainty. All random experiments have 3 things in common: -The experiment can have more than one possible outcome -you can represent each possible outcome in advance -the outcome of the experiment depends on chance Binomial Experiment: A type of random experiment -The experiment consists of a number of repeated trials -Each trial has only two possible outcomes -The probability of success is the same for each trial -each trial is independent Example 1 of a Binomial Experiment: Tossing a Coin 10 times in a row. -10 repeated coin tosses -two possible outcomes: heads or tails -the probability of success for each toss is the same: 50% -the outcome of any one coin toss does not affect the outcome of any other coin toss Example 2 of a Binomial experiment: You want to know how many customers return an item to a department store on a given day. 100 customers visit the store each day. 10% of all customers who visit the store make a return. You label a return as a success. This is a binomial experiment because -100 customer visits -2 possible outcomes: return or no return -the probability of success for each customer visit is the same: 10% -the outcome of one customer visit does not affect the outcome of any other customer visit

Answer 65

model different types of data

Answer 66

Models the probability that a certain number of events will occur during a specific time period Use Cases: You can use the Poisson distribution to model data such as: -calls per hour for a customer service call center -visitors per hour for a website -customers per day at a restaurant -severe storms per month in a city

Answer 67

A type of random experiment. Always have the following attributes: -The number of events in the experiment can be counted -The mean number of events that occur during a specific time period is known -Each event is independent Example: The drive-through at a restaurant receives an average of 2 orders per minute. You want to determine the probability that the restaurant will receive a certain number of orders in a given minute. You can tell this is a Poisson experiment because it meets the above stated criteria. That is, -you can count the number of orders -there is an average of 2 orders per minute -the probability of one person placing an order does not affect the probability of another person placing an order

Answer 68

Type Given Want to Find Example Poisson The avg probability of The probability of The prob of getting 12 calls an event happening for a certain # of between 2-3 pm a specific time period events happening in that time period Binomial An exact probability of The prob of the event The prob of getting 8 an event happening happening a certain # heads in 10 coin tosses of times in a repeated trial (exact prob = 50%)

Answer 69

The binomial and poisson

Answer 70

probability distribution describes the likelihood of the possible outcomes of a random event.

Answer 71

a uniform distribution describes events whose outcomes are all equally likely or have equal probability Example: Rolling a die can result in 6 outcomes: 1, 2, 3, 4, 5, or 6. The probability of each outcome is the same: 1/6 or about 16.7%

Answer 72

Like the binomial distribution, it models events that have only two possible outcomes: success or failure. The only difference: the Bernoulli distribution refers to only a single trial of an experiment., while the binomial refers to repeated trials. Classic example of a Bernoulli: one coin toss.

Answer 73

Often called a bell curve, because of its shape. Often known as a Gaussian distribution. A continuous probability distribution that is symmetrical on both sides of the mean and bell-shaped. The most common probability distribution in statistics, because so many data sets display a bell curve. Example 1: If you randomly sample 100 people, you'll discover a normal distribution of continuous variables like -height -weight -blood pressure -IQ scores -salaries Example 2: Standardized tests. The majority of people will score close to the average/mean score. Fewer numbers of people will score above or below average. Normal distributions have the following features: -the shape is a bell curve -the mean is located at the center of the curve -the curve is symmetrical on both sides of the center -the total area under the curve equals 1

Answer 74

calculates the typical distance of a data point from the mean of a dataset

Answer 75

-68% of values fall within 1 standard deviation of the mean -95% of values fall within 2 standard deviations of the mean -99.7% of values fall within 3 standard deviations of the mean the empirical rule can give you an idea of how values in your dataset are distributed. (e.g., what percentage of values with fall within 1 or 2 or 3 standard deviations of the mean). It's also helpful for detecting outliers. (Standardly values that lie more than 3 standard deviations above or below the mean are outliers.) You need to detect outliers because some values may be due to errors in data collection or data processing. These false values may skew the results of your analysis. Note: standard deviation: calculates the typical distance of a data point from the mean of a dataset

Answer 76

binominal distribution bernoulli distribution poisson distribution Note: continuous probability distributions represent continuous random variables, which can take on all the possible values in a range of numbers. Typically, these are decimal values that can be measured, such as height, weight, time, or temperature. For example, you can keep on measuring time with more accuracy: 1.1 seconds, 1.12 seconds, 1.1257 seconds, and so on. Because there are infinite values that X could assume, the probability of X taking on any one specific value is zero. Therefore we often speak in ranges of values (p(X>0) = .50). The normal distribution is one example of a continuous distribution. (p(X>0) = .50 means the probability that x is greater than 0 equals 50%. *Why 50%? Because the distribution is centered at zero (the bell curve), so half the distribution is greater than 0 and the other half is less than 0. And because the normal distribution is a continuous distribution, we can not calculate exact probability for an outcome, but instead we calculate a probability for a range of outcomes (for example the probability that a random variable X is greater than 10). Example: Let's say we want to calculate the probability that z is between -1 and 1. To do so, first look up the probability that z is less than negative one [p(z)<-1 = 0.1538]. (This means that the probability that z is less than negative one (i.e, -2, etc.) is 0.1538) Because the normal distribution is symmetric, we therefore know that the probability that z is greater than one also equals 0.1587 [p(z)>1 = 0.1587]. (So, the probability that z is greater than 1 is also 0.1587.) To calculate the probability that z falls between 1 and -1, we take 1 – 2(0.1587) = 0.6826. Because .1587*2 represents the probability that z falls below 1 or above -1. 1-(.1587*2) is the prob that z is between 1 and -1 Visual: left tail: .1587 middle (unknown & what we want): right tail: .1587 total probability must equal 1. We know that the left and right tail together = .1587 * 2 ergo 1 -(.1587*2) = the middle.

Answer 77

Probability: actual chance of an event. Probability answers questions like: What is the probability a die lands on 4? → 1/6 What is the probability a person is between 5'6" and 5'8"? → maybe 0.18 Probability is always between: 0≤𝑃≤1 It represents the chance that something happens. Probability density: concentration of probability Probability density tells you: How tightly probability is packed around a specific value It is NOT itself a probability. Instead, it helps you calculate probability over a range. Analogy: population density vs population This is the BEST real-world analogy. Population density: 100 people per square mile Population: depends on how much area you look at Example: Density = 100 people/sq mile Area = 5 sq miles Population = 100 × 5 =500 100×5=500 Same idea: Stats Geography probability density population density probability total population range width area size

Answer 78

A measure of how many standard deviations below or above the population mean a data point is Also called standard scores as they're based on the standard normal distribution, which has a mean of zero and standard deviation of 1. z-scores typically range from -3 tp 3. Example: -The z-score is 0 if the value is equal to the mean -The z-score is positive if the value is greater than the mean -The z-score is negative is the value is less than the mean - A z-score of 1.5 is 1.5 standard deviations above the mean -A z-score of 2 is 2 standard deviations below the mean z = x - mu (mew)/sigma z: z-score x: raw score or single data value mu: the population mean sigma: the population's standard deviation Example: You take a standardized test. You score 133. The test has a mean score of 100 and a standard deviation of 15. z = 133-100/15 z = 2.2 So, your test score is 2.2 standard deviations above the mean. z-scores are useful, because they give us an idea of how an individual value compares to the rest of the distribution.

Answer 79

The process of putting different variables on the same scale

Answer 80

The process of selecting a subset of data from a population

Answer 81

any type of data, including: people, organizations, objects, events, measurements, and more For instance, a population might be the set of: -All students at a university -All the cell phones ever manufactured by a company -All the forests on Earth

Answer 82

Accurately reflects the characteristics of a population

Answer 83

-How many products in an app store do we need to test to feel confident that all the products are secure from malware -How do we select a sample of users to run an effective A/B test for an online retail store? -How do we select a sample of customers of a video streaming service to get reliable feedback on the shows they watch?

Answer 84

-a sample requires less time than a full population -a sample saves money and resources -a sample is more practical than analyzing an entire population

Answer 85

will not be accurate. Ultimately, the quality of your sample helps determine the quality of the insights you share with stakeholders. To make reliable inferences about a population, make sure your sample is representative of the population.

Answer 86

Step 1: Identify the target population (e.g., individuals 18 years and older who are eligible to vote in x city) Step 2: Select the sampling frame (a list of individuals who meet the criteria of the target population, starting with A. Adams and ending with B zenya.) Step 3: Choose the sampling method Step 4: Determine the sample size Step 5: Collect the sample data Example 2: you’re a data professional working for a company that manufactures home appliances. The company wants to find out how customers feel about the innovative digital features on their newest refrigerator model. The refrigerator has been on the market for two years and 10,000 people have purchased it. Your manager asks you to conduct a customer satisfaction survey and share the results with stakeholders. Step 1: Identify the target population: the 10,000 customers who purchased the company’s newest refrigerator model. Step 2: Create a sampling frame: an alphabetical list of the names of all these customers ** Note: Ideally, your sampling frame should include the entire target population. However, for practical reasons, your sampling frame may not exactly match your target population, because you may not have access to every member of the population For instance, the company’s customer database may be incomplete, or contain data processing errors. Or, some customers may have changed their contact information since their purchase, and you may be unable to locate or contact them. Furthermore, sometimes the sampling frame might include elements outside of the target population simply by accident or because it is impossible to know the target population with certainty. ** Step 3: Choose the sampling method (i.e., probability sampling or non-probability sampling). Step 4: Determine the sample size., since you don’t have the resources to survey everyone in your sampling frame. ** Note: In general, the larger the sample size, the more precise your predictions. However, using larger samples typically requires more resources. The sample size you choose depends on various factors, including the sampling method, the size and complexity of the target population, the limits of your resources, your timeline, and the goal of your research. Based on these factors, you can decide how many customers to include in your sample. ** Step 5: Collect the data: You give a customer satisfaction survey to the customers selected for your sample. The survey responses provide useful data on how customers feel about the digital features of the refrigerator. Then, you share your results with stakeholders to help them make more informed decisions about whether to continue to invest in these features for future versions of this refrigerator, and develop similar features for other models.

Answer 87

The complete set of elements that you're interested in knowing more about

Answer 88

A list of all the items in your target population

Answer 89

Probability Sampling: Use random selection to generate a sample. **each person has an equal chance of being selected--this gives you the best chance of a representative sample Non-Probability Sampling: Based on convenience or personal preference.

Answer 90

The number of individuals or items chosen for a study or experiment

Answer 91

-Simple random sampling -Stratified random sampling -Cluster random sampling -Systematic random sampling (These are all based on random selection, which is the preferred method for accurately representing a population and reducing bias.)

Answer 92

A probability sampling method. Every member of a population is selected randomly and has an equal chance of being chosen. You randomly select members using a random number generator or by another method of random selection. Example: There are 1,000 people at your company. You assign a number to each person in your database. Then you use a random number generator to select 100 people. Pros: They tend to be fairly representative and tend to avoid bias since every member of the population has an equal chance of being selected. Cons: It's expensive and time-consuming to conduct large simple random sampling

Answer 93

A probability sampling method. Divide a population into groups and randomly select some members from each group to be in the sample. These groups are called strata. Strata can be organized by age, gender, income, or whatever category you wish to study. Example: You want to conduct a study on how much time high school students spend studying on weekends. You could divide the student pop by age (14, 15, 16, and 17). Then survey and equal number of students from each age group. Pros: Helps ensure that members from each group (i.e., strata) are included. Cons: It can be hard to develop strata if you lack knowledge about the pop to be studied. Example: If you don't know how relevant job title, industry, etc. are to median income (and you're doing a study on just that) it will be difficult to choose the best category (read: strata).

Answer 94

A probability sampling method. Divide a population into clusters, randomly select certain clusters and include all members from the chosen clusters in the sample. Clusters are created based on identifying details (e.g., age, gender, location, etc.). Example: You want to conduct a survey of employees at a global company using this method. The company has ten offices in ten cities around the world. Each office has about the same number of employees with similar roles. You randomly select 3 offices in 3 cities as clusters. You include all the employees at the 3 offices. Pros: -Gets every member from a particular cluster, which is useful when each cluster reflects the population as a whole. -Helpful when dealing with large and diverse populations that have clearly defined subgroups. Example: If researchers want to learn about the preferences of primary school students in Oslo, Norway, they can use one school as representative of all schools in the city. Con: It may be difficult to create clusters that accurately reflect the overall population. Example: you may only have access to offices in the united states vs the whole world, and employees in the US may have different characteristics and values.

Answer 95

A probability sampling method. Put every member of a population into an ordered sequence. Then, you choose a random starting point in the sequence and select members for your sample at regular intervals. Pro: -Often representative of a pop since each member has an equal chance of being represented -Quick and convenient when you have a complete list of the population Con: -You need to know the size of the pop that you want to study before you begin. (If you don't have this info, it's difficult to choose consistent intervals).

Answer 96

When a sample is not representative of the population as a whole

Answer 97

Probability sampling methods use random selection, which helps avoid sampling bias. Non-probability sampling methods do not use random selection. These often result in biased samples. (The sample is often not representative of the population as a whole.)

Answer 98

It's often less expensive and more convenient for researchers to conduct. Issues: It often results in biased samples. (The sample is often not representative of the population as a whole.)

Answer 99

-convenience sampling -voluntary response sampling -snowball sampling -purposive sampling

Answer 100

Non-probability sampling. Choose members of a population that are easy to contact or reach (e.g., you workplace, school, or public park). Example: To conduct an opinion poll, a researcher may stand in front of a local high school during the day to poll the people who happen to walk by. Cons: Convenience sampling often shows undercoverage bias. (In the above example, people who don't work or attend the school will not be represented in a sample.) ** Undercoverage bias: When some members of a population are inadequately represented in a sample.

Answer 101

When some members of a population are inadequately represented in a sample.

Answer 102

Non-probability sampling. Consists of members of a population who volunteer to participate in a study. Example: Restaurant owners want to know how their customers feel about their dinner options. They ask their regular customers to take an online survey about the quality of the restaurant's food. Cons: Voluntary response sampling tends to suffer from nonresponse bias. Note: People who voluntarily respond are likely to have stronger opinions (either positive or negative) than the rest of the population. This makes the volunteer respondents at the restaurant in the above example an unrepresentative sample. ** Nonresponse Bias: When certain groups of people are less likely to provide responses.

Answer 103

Non-probability sampling. Researchers recruit initial participants to be in a study and then ask them to recruit other people to participate in the study. (Like a snowball, the sample gets bigger and bigger as more participants join in.) This type of recruiting can result in sampling bias. Since initial participants will recruit additional participants on their own, it's likely that they'll share similar characteristics. And these characteristic may not be representative of the total pop being studied.

Answer 104

Non-probability sampling. Researchers select participants based on the purpose of their study. Applicants who do not fit the profile are rejected. This can lead to biased outcomes, because the individuals in the sample are not representative of the population as a whole. Example: Researcher wants to survey students on the efficacy of certain teaching methods at their university. The researcher only includes the students who regularly attend class and have an established record of academic achievement. They select the students with the highest grade point averages. Issue: biased outcome. (see above note.)

Answer 105

When certain groups of people are less likely to provide responses.

Answer 106

Non-probability sampling is useful for collecting data in situations where you have limited time, budget, and other resources. Non-probability sampling is also useful for exploratory research, when you want to get an initial understanding of a population, rather than make inferences about the population as a whole. However, it’s important to remember that non-probability sampling methods have a high risk of sampling bias.

Answer 107

Statistic: A characteristic of a sample Parameter: A characteristic of a population Example: -The mean weight of a random sample of 100 penguins is a statistic -The mean weight of the total population of 10,000 penguins is a parameter

Answer 108

Uses a single value to estimate a population parameter Example: A data professional may use the mean weight of 100 penguins to estimate the mean weight of the population (of penguins). This is an example of using a single value to estimate a population parameter. Note: Parameter: A characteristic of a population -The mean weight of the total population of 10,000 penguins is a parameter (-The mean weight of a random sample of 100 penguins is a statistic)

Answer 109

A probability distribution of a sample statistic Let's say you take repeated samples of the same size from a population. Since each sample is random, the mean value will vary from sample to sample in a way that can't be predicted. Notes: Probability Distribution: possible outcomes of a random variable like a coin toss or a die roll. Statistic: A characteristic of a sample (e.g., mean weight of a random sample of 100 penguins). Example of sampling distribution: Pop of 10,000 penguins. You take repeated samples of the same size from this population, looking at weight. Since each sample is random, the mean value will vary from sample to sample in a way that can't be predicted. First sample of 10 penguins, mean weight: 3.1 lbs Second sample of 10 penguins, mean weight: 2.8 lbs Third sample of 10 penguins, mean weight: 2.9 lbs Each time you take a sample, you'll get closer to the population mean. (You can get outliers, e.g., a sample of larger than average penguins or smaller than average penguins)

Answer 110

how much an estimate varies between samples (You can use a sampling distribution to represent the frequency of all your different sample means.)

Answer 111

As you increase the size of a sample, the mean weight of your sample data will get closer to the mean weight of the population. If you sampled the entire population, your sample mean would be the same as your population mean. But to get an accurate estimate of the population mean, you don't need to sample the entire population. If you take a large enough sample from a population (e.g., 100 from 10,000), your sample mean will be an accurate estimate of the population mean.

Answer 112

Population mean Example: Sample of 100 penguins, mean weight: 3 lbs. So, your best estimate for the entire population is 3 lbs. The population mean, in this example, is 3.1 lbs.

Answer 113

The more variability in your sample data, the less likely the sample mean is an accurate estimate of the population mean. Example: Population mean for blue penguins: 3.1 lbs -sample mean 1: 3.3 lbs -sample mean 2: 2.8 lbs -sample mean 3: 2.4 lbs Note: Standard deviation measures the variability of your data (i.e., how spread out your values are). The more spread between the data values, the larger the standard deviation.

Answer 114

Larger standard error = sample means are more spread out Smaller standard error = sample means are closer together

Answer 115

Keep in mind that the concept of standard error assumes repeated sampling. In reality, researchers usually work with a single sample. It's often too complicated, expensive, or time-consuming to take repeated samples of a population. Instead, statisticians have derived a formula for calculating standard error based on the mathematical assumption of repeated sampling. Standard error of the mean: s/√n s: sample standard deviation n: sample size Example: Sample of 100 blue penguins has a mean weight of 3 lbs and a standard deviation of 1 lb. =s/√n =1/√100 =0.1 lbs So, 0.1 is the standard error of the mean

Answer 116

-the mean annual household income for an entire city or country -the mean height and weight for an entire animal or plant population -the mean commute time for all employees of a large corporation

Answer 117

As the sample increases, your sampling distribution assumes the shape of a bell curve. AND If you take a large enough sample, the sample mean will be roughly equal to the population mean. This holds true for ANY population. You don't need to know the shape of your population (i.e., right-skewed or left-skewed distribution, etc.) in advance to apply the theorem. If you collect a large enough sample, the shape of your sampling distribution will follow a normal distribution. Note: There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply, but in general, a sample size of thirty or more is considered sufficient. Exploratory Data Analysis (EDA) can help you determine how large of a sample size is necessary for a given data set.

Answer 118

-Randomization: Your sample data must be the result of random selection. Random selection means that every member in the population has an equal chance of being chosen for the sample. -Independence: Your sample values must be independent of each other. Independence means that the value of one observation does not affect the value of another observation. Typically, if you know that the individuals or items in your dataset were selected randomly, you can also assume independence. -10%: To help ensure that the condition of independence is met, your sample size should be no larger than 10% of the total population when the sample is drawn without replacement (which is usually the case). (Note: In general, you can sample with or without replacement. When a population element can be selected only one time, you are sampling without replacement. When a population element can be selected more than one time, you are sampling with replacement.) There is no exact rule for how large a sample size needs to be in order for the central limit theorem to apply. The answer depends on the following factors: -Requirements for precision. The larger the sample size, the more closely your sampling distribution will resemble a normal distribution, and the more precise your estimate of the population mean will be. -The shape of the population. If your population distribution is roughly bell-shaped and already resembles a normal distribution, the sampling distribution of the sample mean will be close to a normal distribution even with a small sample size. In general, many statisticians and data professionals consider a sample size of 30 to be sufficient when the population distribution is roughly bell-shaped, or approximately normal. However, if the original population is not normal—for example, if it’s extremely skewed or has lots of outliers—data professionals often prefer the sample size to be a bit larger. Exploratory data analysis can help you determine how large of a sample is necessary for a given dataset.

Answer 119

The percentage of individuals or elements in a population that share a certain characteristic

Answer 120

A probability distribution of a sample statistic. Notes: Probability Distribution: represents the possible outcomes of a random variable, such as a coin toss or a die roll Sample statistics are based on randomly sampled data, and their outcome cannot be predicted with certainty.

Answer 121

Sampling distributions describe the uncertainty associated with a sample statistic, and help you make proper statistical inferences. This is important because stakeholder decisions are often based on the estimates you provide.

Answer 122

When a population element can be selected more than one time. Steps for Sampling with Replacement 1. Select an item randomly from the population. 2. Record the selected item. 3. Replace the item into the population. 4. Repeat the process until the desired sample size is achieved. Use Case: Bootstrapping, Montecarlo methods Notes Bootstrapping: A revival method used in data where samples are designed with replacement to estimate the distribution of a statistical. Monte Carlo Simulation: Used in simulation where different scenarios require random samples with replacement to model.

Answer 123

When a population element can be selected only one time Steps for Sampling without Replacement 1. Select an item randomly from the population. 2. Record the selected item. 3. Remove the selected item from the population. 4. Repeat the process until the desired sample size is achieved Use case: Lottery draws, survey sampling Notes: Lottery Draws: Drawing lottery numbers without replacement ensures that no number can appear more than once. Survey Sampling: Selecting participants for a survey where no individual can be chosen more than once.

Answer 124

A starting point for generating random numbers

Answer 125

Example: sampled_data = epa_data.sample(50, replace=True, random_state=42) sampled_data

Answer 126

Standard Error: The standard deviation of the means Example: We weighed 5 mice. We find the mean and the standard deviation (how spread out the data is). We do this experiment 6 times. So, we have 6 means and 6 standard deviations. You can then calculate the mean of the means (let's call that x) and the standard deviation of x. The standard deviation of x (I.e., the mean of the means) is the standard error.

Answer 127

A range of values that describes the uncertainty surrounding an estimate. Technically, 95% confidence means that if you take repeated random samples from a population, and construct a confidence interval for each sample using the same method, you can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. Use cases: Data professional may use a confidence interval to describe the uncertainty of an estimate for: -the average return on an investment for a stock portfolio -the average maintenance costs for factory machinery -the percentage of customers who will register for a rewards program -the percentage of website visitors who will click on an ad Confidence Level: The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling.

Answer 128

Frequentist: (e.g., confidence interval) Bayesian: (e.g., credible interval)

Answer 129

Uses a single value to estimate a population parameter

Answer 130

Uses a range of values to estimate a population parameter

Answer 131

Population Parameter: A number describing the whole population (e.g., the population mean) vs Statistic: A number describing the sample (e.g., the sample mean) Ex// Pop Parameter: Proportion of all US residents that support the death penalty Sample Statistic: Proportion of 2000 randomly sampled participants that support the death penalty. Pop parameter: Median income of all college students in Massachusetts Sample Statistic: Median income of 850 college students in Boston and Wellesley. Pop parameter: Standard deviation of weights of all avocados in the region. Sample Statistic: Standard deviation of weights of avocados from one farm

Answer 132

-Sample statistic -Margin of error -Confidence level Examples: Sample mean of our sample of penguins is 30 lbs. Notes: -Population Parameter: A number describing the whole population (e.g., the population mean) -Statistic: A number describing the sample (e.g., the sample mean) -Margin of error: The max expected difference between a pop parameter and a sample estimate. -Pop parameter: Standard deviation of weights of all avocados in the region. -Sample Statistic: Standard deviation of weights of avocados from one farm Confidence Interval: sample statistic +/- margin of error -Confidence Level: The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling.

Answer 133

It means: -95% of the intervals capture the population mean -5% of the intervals do not capture the population mean. Technically, 95% confidence means that if you take repeated random samples from a population, and construct a confidence interval for each sample using the same method, you can expect that 95% of these intervals will capture the population mean. You can also expect that 5% of the total will not capture the population mean. The confidence level refers to the long-term success rate of the method, or the estimation process based on random sampling. Pro tip: Remember that a 95% confidence level refers to the success rate of the estimation process. ~~~ Note: confidence interval includes: -Sample statistic -Margin of error -Confidence level ~~~ 1) It does not mean: we are 95% confident that the [population parameter] is between [L] and [U]." Remember that when we're constructing a confidence interval we are estimating a population parameter when we only have data from a sample. We don't know if our sample statistic is less than, greater than, or approximately equal to the population parameter. And, we don't know for sure if our confidence interval contains the population parameter or not. 2) Misinterpretation 2: 95% refers to the percentage of data values that fall within the interval. A 95% confidence interval shows a range of values that likely includes the actual population mean. This is not the same as a range that contains 95% of the data values in the population. Example: Correlation Between Height and Weight At the beginning of the Spring 2017 semester a sample of World Campus students were surveyed and asked for their height and weight. In the sample, Pearson's r = 0.487. A 95% confidence interval was computed of [0.410, 0.559]. Interpretation: The correct interpretation of this confidence interval is that we are 95% confident that the correlation between height and weight in the population of all World Campus students is between 0.410 and 0.559. Example: Seatbelt Usage A sample of 12th grade females was surveyed about their seatbelt usage. A 95% confidence interval for the proportion of all 12th grade females who always wear their seatbelt was computed to be [0.612, 0.668]. Interpretation: The correct interpretation of this confidence interval is that we are 95% confident that the proportion of all 12th grade females who always wear their seatbelt in the population is between 0.612 and 0.668.

Answer 134

if you're working with a small sample size, and your data is approximately normally distributed, you should use the t-distribution rather than the standard normal distribution. For a t-distribution, you use t-scores to make calculations about your data. The graph of the t-distribution has a bell shape that is similar to the standard normal distribution. But, the t-distribution has bigger tails than the standard normal distribution does. The bigger tails indicate the higher frequency of outliers that come with a small dataset. As the sample size increases, the t-distribution approaches the normal distribution. When the sample size reaches 30, the distributions are practically the same, and you can use the normal distribution for your calculations.

Answer 135

A statistical procedure that uses sample data to evaluate an assumption about a population parameter

Answer 136

The claim that the results of a test or experiment are not explainable by chance alone If a result is statistically significant, that means it’s unlikely to be explained solely by chance or random factors.

Answer 137

1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value 4. Reject or fail to reject the null hypothesis Example: You want to determine if a coin is fair (i.e., confirm that it is not weighted to favor one side). You'll flip the coin 6 times and record the results. 1. Null Hypothesis: The coin is fair Alternative Hypothesis: The coin is not fair 2. Significance Level = 5% 3. P-value = 1.56% (the probability of a fair coin landing on tails 6 times in a row is 1.56%. Because there is a 50% chance a fair coin will land on heads or tails. .50*.50* .50 * .50 * .50 * .50 is 0.0156 or 1.56%. Anything more than 1.56% would mean there is stronger evidence for the alternative hypothesis (i.e., the coin is not fair). 4. Decide whether to reject or fail to reject the null hypothesis (statisticians never say "accept" just "fail to reject" because "accept" would suggest certainty which you never have with probability). If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis

Answer 138

A statement that is assumed to be true unless there is convincing evidence to the contrary The null hypothesis typically assumes that the observed data occurs by chance. Note:s -The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample. -In statistics, the null hypothesis is often abbreviated as H sub zero (H0). -Null hypotheses often include phrases such as “no effect,” “no difference,” “no relationship,” or “no change.” -When written in mathematical terms, the null hypothesis always includes an equality symbol (usually =, but sometimes ≤ or ≥). *why? The null hypothesis represents the "boring" default assumption — that nothing interesting is happening. For example: "this drug has no effect," or "these two groups are identical." Equality symbols fit perfectly here because you're essentially saying things are the same, unchanged, or at baseline. Rule of thumb: Typically, the null hypothesis represents the status quo, or the current state of things. The null hypothesis assumes that the status quo hasn’t changed. Example#1: Mean weight An organic food company is famous for their granola. The company claims each bag they produce contains 300 grams of granola—no more and no less. To test this claim, a quality control expert measures the weight of a random sample of 40 bags. H0: μ = 300 (the mean weight of all produced granola bags is equal to 300 grams) Ha: μ ≠ 300 (the mean weight of all produced granola bags is not equal to 300 grams) Note: μ (pronounced "mew") is just the symbol for the mean (average) of a population.

Answer 139

A statement that contradicts the null hypothesis and is accepted as true only if there is convincing evidence for it The alternative hypothesis typically assumes that the observed data does not occur by chance. Notes: -The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample. -In statistics, the alternative hypothesis is often abbreviated as H sub a (Ha). Alternative hypotheses often include phrases such as “an effect,” “a difference,” “a relationship,” or “a change.” -When written in mathematical terms, the alternative hypothesis always includes an inequality symbol (usually ≠, but sometimes < or >). *why? The alternative hypothesis is what you're trying to find evidence for — that something is happening, that there's a difference, a change, an effect. You can't really put an equals sign on that, because you're saying things are not the same. Rule of Thumb: The alternative hypothesis does NOT assume that the status quo hasn't changed; instead, it suggests a new possibility or different explanation. Example#1: Mean weight An organic food company is famous for their granola. The company claims each bag they produce contains 300 grams of granola—no more and no less. To test this claim, a quality control expert measures the weight of a random sample of 40 bags. H0: μ = 300 (the mean weight of all produced granola bags is equal to 300 grams) Ha: μ ≠ 300 (the mean weight of all produced granola bags is not equal to 300 grams) Note: Note: μ (pronounced "mew") is just the symbol for the mean (average) of a population.

Answer 140

Also known as alpha (α) the threshold at which you will consider a result statistically significant (i.e., the threshold at which you reject the null hypothesis--if it is less than the threshold, you reject the null hypothesis). Standardly data professionals set the significance level at 0.05 or 5%. Other common choices are 1% and 10%. You can adjust the significance level to meet the specific requirements of your analysis. A lower significance level means an effect has to be larger to be considered statistically significant. A significance level of 5% means you are willing to accept a 5% chance that you are wrong when you reject (or fail to reject) a null hypothesis.

Answer 141

The p value, or probability value, tells you the statistical significance of a finding. Example: Null Hypothesis: The coin is fair Alternative Hypothesis: The coin is not fair P-value = 1.56% (the probability of a fair coin landing on tails 6 times in a row is 1.56%. Because there is a 50% chance a fair coin will land on heads or tails. .50*.50* .50 * .50 * .50 * .50 is 0.0156 or 1.56%) p-value= 1.56% (or 0.0156) So, anything more than 1.56% would mean that there is stronger evidence for the alternative hypothesis (i.e., the coin is not fair). ** Rule of Thumb: A low p-value indicates high statistical significance (meaning you reject the null hypothesis), while a high p-value indicates low or no statistical significance (meaning you fail to reject the null hypothesis.) **

Answer 142

Type 1 Error (False Positive): The rejection of a null hypothesis that is actually true. (That is, you conclude that your result is statistically significant when it actually occurred by chance.) To reduce your chance of a type 1 error, choose a lower significance level. (A significance level of 5% means you are willing to accept a 5% chance that you are wrong when you reject (or fail to reject) a null hypothesis.) Type 2 Error (False Negative): The failure to reject a null hypothesis which is actually false. That is, you conclude that your result occurred by chance when it is actually statistically significant, Note: Choosing a lower significance level makes it more likely that you will have this error.

Answer 143

μ (pronounced "mew") is just the symbol for the mean (average) of a population.

Answer 144

the hypothesis written mathematically: H1: μ > 8.2 null hypothesis: μ = 8.2 alternative hypothesis: μ <= 8.2 Notes: H0: null hypothesis (number should be in subscript) H1: Alternative hypothesis (number should be in subscript) μ: average (pronounced "mew")

Answer 145

Also known as a false positive. Reject the null hypothesis when it’s actually true. This means that you report that your findings are significant when they have occurred by chance. The probability of making a type 1 error is represented by your alpha level (α), the p-value below which you reject the null hypothesis. To reduce your chance of making a Type I error, choose a lower significance level. (But this can increase your risk of making a type II error) **Type I errors are like false alarms**

Answer 146

Also known as a false negative. Fail to reject the null hypothesis when it’s actually false The probability of making a type II error is called Beta (β), which is related to the power of the statistical test (power = 1- β). You can reduce your risk of making a Type II error by ensuring your test has enough power. In data work, power is usually set at 0.80 or 80%. The higher the statistical power, the lower the probability of making a Type II error. **Type II errors are like missed opportunities**

Answer 147

No, for a research hypothesis to be correct, we would need 100% certainty. Because a p -value is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis ( H0).

Answer 148

Resource Allocation: Making a Type I error can lead to wastage of resources. If a business believes a new strategy is effective when it’s not (based on a Type I error), they might allocate significant financial and human resources toward that ineffective strategy. Unnecessary Interventions: In medical trials, a Type I error might lead to the belief that a new treatment is effective when it isn’t. As a result, patients might undergo unnecessary treatments, risking potential side effects without any benefit. Reputation and Credibility: For researchers, making repeated Type I errors can harm their professional reputation. If they frequently claim groundbreaking results that are later refuted, their credibility in the scientific community might diminish. Notes: Type 1 Error: Also known as a false positive. Reject the null hypothesis when it’s actually true. This means that you report that your findings are significant when they have occurred by chance.

Answer 149

Missed Opportunities: A Type II error can lead to missed opportunities for improvement or innovation. For example, in education, if a more effective teaching method is overlooked because of a Type II error, students might miss out on a better learning experience. Potential Risks: In healthcare, a Type II error might mean overlooking a harmful side effect of a medication because the research didn’t detect its harmful impacts. As a result, patients might continue using a harmful treatment. Stagnation: In the business world, making a Type II error can result in continued investment in outdated or less efficient methods. This can lead to stagnation and the inability to compete effectively in the marketplace. Notes: Type II error: Also known as a false negative. Fail to reject the null hypothesis when it’s actually false The probability of making a type II error is called Beta (β), which is related to the power of the statistical test (power = 1- β).

Answer 150

Determines whether or not a population parameter like a mean or proportion is equal to a specific value

Answer 151

Determines whether or not two population parameters such as two means or two proportions are equal to each other Example: A/B testing

Answer 152

-The data is a random sample of a normally distributed population -The population standard deviation is known Note: z-score: A measure of how many standard deviations below or above the population mean a data point is Also called standard scores as they're based on the standard normal distribution, which has a mean of zero and standard deviation of 1.

Answer 153

You work at a chain restaurant that conducted a study on theirfood delivery turnaround. Typically the mean delivery time is 40 minutes. But the mean delivery of the sample is 38 minutes. null hypothesis: mean delivery = 40 (which is to say: the difference reported above is due to chance or sampling variability) p-value: the probability of observing a difference that is 2 minutes or greater if the null hypothesis is true. If p-value is less than 5% (chosen significance level), reject the null hypothesis *Why? A small p-value means the null hypothesis has become very hard to believe. If the p-value drops below that 5% threshold, you've decided the data is too unlikely under the null to keep accepting it, so you reject it. z-score in this example is -2.82 On the standard deviation, -2.82 (the observed test statistic) is far to the left. p-value is 0.0023 or 0.23% 0.0023 < significance level, so you must reject null hypothesis As such, it's unlikely that the 2 minute delivery difference is due to chance, which means that we can reject the null hypothesis in favor of the alternative hypothesis. As such, we should study what the delivery drivers in this sample are doing to deliver faster and train the drivers in other areas to follow suit. NOTE: As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like Python or other statistical software. In this example, you’re conducting a z-test, so your test statistic is a z-score of 2.53. Based on this test statistic, you calculate a p-value of 0.0057, or 0.57%.

Answer 154

A value that shows how closely your observed data matches the distribution expected under the null hypothesis. In other words: A test statistic indicates how closely your data match the null hypothesis. Following formula gives you a test statistic z based on your sample data Z = ( x ‾ − μ ) / ( σ / √n ) z: z-score: A measure of how many standard deviations below or above the population mean a data point is x ‾ : (pronounced x-bar) the sample mean μ: (pronounced mew) the population mean σ: (pronounced sigma) the population standard deviation n: the sample size Note: The p-value is calculated from the test statistic

Answer 155

If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis (can't say "accept" as that would imply absolute certainty that the null hypothesis is correct, and you never have 100% certainty with statistics.)

Answer 156

-A test statistic that indicates how closely your data match the null hypothesis. For a z-test, your test statistic is a z-score; for a t-test, it’s a t-score. -A corresponding p-value that tells you the probability of obtaining a result at least as extreme as the observed result if the null hypothesis is true.

Answer 157

As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like Python or other statistical software.

Answer 158

-If your p-value is less than your significance level, you reject the null hypothesis. -If your p-value is greater than your significance level, you fail to reject the null hypothesis.

Answer 159

This is because hypothesis tests are based on probability, not certainty—acceptance implies certainty. In general, data professionals avoid claiming certainty about results based on statistical methods.

Answer 160

In quantitative research, data are analyzed through null hypothesis significance testing, or hypothesis testing. This is a formal procedure for assessing whether a relationship between variables or a difference between groups is statistically significant. Null and alternative hypotheses To begin, research predictions are rephrased into two main hypotheses: the null and alternative hypothesis. A null hypothesis (H0) always predicts no true effect, no relationship between variables, or no difference between groups. An alternative hypothesis (Ha or H1) states your main prediction of a true effect, a relationship between variables, or a difference between groups. Hypothesis testing always starts with the assumption that the null hypothesis is true. Using this procedure, you can assess the likelihood (probability) of obtaining your results under this assumption. Based on the outcome of the test, you can reject or retain the null hypothesis.

Answer 161

-Researchers classify results as statistically significant or non-significant using a conventional threshold that lacks any theoretical or practical basis (5% significance level is most common; other common significance levels are 1% and 10%.). This means that even a tiny 0.001 decrease in a p value can convert a research finding from statistically non-significant to significant with almost no real change in the effect. -On its own, statistical significance may also be misleading because it’s affected by sample size. In extremely large samples, you’re more likely to obtain statistically significant results, even if the effect is actually small or negligible in the real world. This means that small effects are often exaggerated if they meet the significance threshold, while interesting results are ignored when they fall short of meeting the threshold. -The strong emphasis on statistical significance has led to a serious publication bias and replication crisis in the social sciences and medicine over the last few decades. Results are usually only published in academic journals if they show statistically significant results—but statistically significant results often can’t be reproduced in high quality replication studies. As a result, many scientists call for retiring statistical significance as a decision-making tool in favor of more nuanced approaches to interpreting results. That’s why APA guidelines advise reporting not only p values but also effect sizes and confidence intervals wherever possible to show the real world implications of a research outcome. Notes: Effect Sizes: Effect size tells you how meaningful the relationship between variables or the difference between groups is. It indicates the practical significance of a research outcome. Confidence Intervals: the range of values that you expect your estimate to fall between a certain percentage of the time if you run your experiment again or re-sample the population in the same way.

Answer 162

Effect size tells you how meaningful the relationship between variables or the difference between groups is. It indicates the practical significance of a research outcome. While statistical significance shows that an effect exists in a study, practical significance shows that the effect is large enough to be meaningful in the real world. Statistical significance is denoted by p values, whereas practical significance is represented by effect sizes. Example: Statistical significance vs practical significance A large study compared two weight loss methods with 13,000 participants in a control intervention group and 13,000 participants in an experimental intervention group. The control group used scientifically backed methods for weight loss, while the experimental group used a new app-based method. After six months, the mean weight loss (kg) for the experimental intervention group (M = 10.6, SD = 6.7) was marginally higher than the mean weight loss for the control intervention group (M = 10.5, SD = 6.8). These results were statistically significant (p = .01). However, a difference of only 0.1 kilo between the groups is negligible and doesn’t really tell you that one method should be favored over the other. Adding a measure of practical significance would show how promising this new intervention is relative to existing interventions. How do you calculate effect size? The are dozens of measures for effect sizes but the most common effect sizes are Cohen's d and Pearson's r. Cohen's d: Cohen’s d is designed for comparing two groups. It takes the difference between two means and expresses it in standard deviation units. It tells you how many standard deviations lie between the two means. In general, the greater the Cohen’s d, the larger the effect size. Pearson's r: Pearson’s r, or the correlation coefficient, measures the extent of a linear relationship between two variables. The formula is rather complex, so it’s best to use a statistical software to calculate Pearson’s r accurately from the raw data. or Pearson’s r, the closer the value is to 0, the smaller the effect size.

Answer 163

-The two samples are independent of each other -For each sample, the data is drawn randomly from a normally distributed population -The population standard deviation is unknown

Answer 164

Data professionals standardly use -a z-test when the population standard deviation is known -a t-test when the population standard deviation is unknown and needs to be estimated from the data. Note: In practice, the population standard deviation is standardly unknown, because it is difficult to get complete data on large populations

Answer 165

The test statistic for t-test. t-scores are based on the t-distribution (as opposed to z-scores which are based on the standard normal distribution). Notes: As the sample size increases, the t-distribution approaches the normal distribution. (Use a t-distribution when sample is 30 or smaller, that is n < 30)

Answer 166

two-sample hypothesis test (or two-sample test): determines whether two population parameters, such as two means, are equal one-sample hypothesis test (or one-sample test); determines whether a population parameter is equal to a specific value.

Answer 167

extreme in the direction(s) of the alternative hypothesis

Answer 168

A one-tailed test results when the alternative hypothesis states that the actual value of a population parameter is either less than or greater than the value in the null hypothesis. A one-tailed test may be either left-tailed or right-tailed. A left-tailed test results when the alternative hypothesis states that the actual value of the parameter is less than the value in the null hypothesis. A right-tailed test results when the alternative hypothesis states that the actual value of the parameter is greater than the value in the null hypothesis. For example, imagine a test in which the null hypothesis states that the mean weight of a penguin population equals 30 lbs. In a left-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is less than (“<“) 30 lbs. In a right-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is greater than (“>”) 30 lbs.

Answer 169

A two-tailed test results when the alternative hypothesis states that the actual value of the parameter does not equal the value in the null hypothesis. For example, imagine a test in which the null hypothesis states that the mean weight of a penguin population equals 30 lbs. In a two-tailed test, the alternative hypothesis might state that the mean weight of the penguin population is not equal (“≠”) to 30 lbs. REMEMBER: p-value is the probability of observing results as or more extreme than those observed when the null hypothesis is true. In the context of hypothesis testing, “extreme” means extreme in the direction(s) of the alternative hypothesis (e.g., a z-score that is less than -1.75 or greater than 1.75).

Answer 170

H0: Null hypothesis Ha: Alternative hypothesis Note: The 0 and a should be subscripted

Answer 171

You can use one-tailed and two-tailed tests to examine different effects. In general, a one-tailed test may provide more power to detect an effect in a single direction. However, before conducting a one-tailed test, you should consider the consequences of missing an effect in the other direction. For example, imagine a pharmaceutical company develops a new medication they believe is more effective than an existing medication. As a data professional analyzing the results of the clinical trial, you may wish to choose a one-tailed test to maximize your ability to detect the improvement. In doing so, you fail to test for the possibility that the new medication is less effective than the existing medication. And, of course, the company doesn’t want to release a less effective medication to the public. A one-tailed test may be appropriate if the negative consequences of missing an effect in the untested direction are minimal. For example, imagine that the company develops a new, less expensive medication that they believe is at least as effective as the existing medication. The lower price gives the new medication an advantage in the market. So, they just want to make sure the new medication is not less effective than the existing medication. Testing whether it’s more effective is not a priority. In this case, a one-tailed test may be appropriate.

Answer 172

For technical reasons, the best course is to use a z-test. Examples: Use a 2-sample z-test to compare the proportion of -defects among manufacturing products on two assembly lines -side effects to a new medicine for two trial groups -support for a new law among voters in two districts Example: A company has an office in Beijing and London. HR wants to determine if there is a different level of employee satisfaction at the two offices. -The team surveys a random selection of 50 employees at each office to see if they are satisfied with their current job. -Your goal: Determine if there is a statistically significant difference in the proportion of satisfied employees in London vs Beijing. If so, the HR team will deploy resources to investigate why employees at one office are more satisfied. -Results: London: 67% satisfied Beijing: 57% satisfied So, you conduct a 2-sample 2-text to analyze the data: 1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value (first find the test statistic, either the z-value or t-value, depending on the text type, then you can calculate the p-value) 4. Reject or fail to reject the null hypothesis 1. Null Hypothesis: There is no difference in the proportion of satisfied employees in London and Beijing. Alternative Hypothesis: There is a difference in the proportion of satisfied employees in London and Beijing. 2. Significance Level = 5% (company's standard for employee surveys) 3. If p-value < 5%: reject the null hypothesis If p-value > 5%: fail to reject the null hypothesis (calculate the p-value on your computer using python or a similar programming language) To find the p-value, you first need to find your test statistic z: Z = (p̂₁ - p̂₂) / sqrt( p̂₀(1-p̂₀)(1/n₁ + 1/n₂) ) p̂₁: sample proportion for the first group p̂₂: sample proportion for the second group 1/n₁: sample size for the first group 1/n₂: sample size for the second group p̂₀: the pooled proportion, a weighted average of the two proportions from your samples. (This has a separate formula.) z = 0.67 - 0.57/ sqrt( 0.62(1-0.62(1/50+1/50)) z = 1.03 p-value (calculate using python): 30.3% Remember: If p-value < significance level: reject the null hypothesis If p-value > significance level: fail to reject the null hypothesis 30.3% > 5% So, fail to reject the null hypothesis. There is not a statistically significant difference in satisfied employees in the London vs Beijing office. In other words, the observed difference in proportions is likely due to chance. Value Add: Your results will likely save the HR team a great deal of time and money. The HR team will not have to dedicate money to investigating reasons for the difference in the two offices. Note: ^ is pronounced hat (e.g., p̂₁ is p one hat) The hat (^) over a letter means it's an estimate calculated from your sample data, as opposed to the true population value. p would mean the true proportion of the whole population — which you usually can't know p̂ means the proportion you measured from your sample — your best guess at the true p

Answer 173

^ is pronounced hat (e.g., p̂₁ is p one hat) The hat (^) over a letter means it's an estimate calculated from your sample data, as opposed to the true population value. p would mean the true proportion of the whole population — which you usually can't know p̂ means the proportion you measured from your sample — your best guess at the true p

Answer 174

1. State the null hypothesis and the alternative hypothesis 2. Choose a significance level 3. Find the p-value (first find the test statistic, either the z-value or t-value, depending on the text type, then you can calculate the p-value) 4. Reject or fail to reject the null hypothesis

Answer 175

Average revenue per user: How much revenue does a user generate for a website? Average session duration: How long does a user remain on a website? Click rate: If a user is shown an ad, does the user click on it? Conversion rate: If a user is shown an ad, will that user convert into a customer?

Answer 176

How much revenue does a user generate for a website?

Answer 177

How long does a user remain on a website?

Answer 178

Click rate: If a user is shown an ad, does the user click on it?

Answer 179

Conversion rate: If a user is shown an ad, will that user convert into a customer?

Answer 180

1. Test design (e.g., A/B test) 2. Sampling (e.g., random selection) 3. Hypothesis testing (e.g., z-test or t-test)

Answer 181

In a randomized controlled experiment, test subjects are randomly assigned to a control group and a treatment group. (An A/B test is a basic version of what’s known as a randomized controlled experiment. ) The treatment is the new change being tested in the experiment. The control group is not exposed to the treatment. The treatment group is exposed to the treatment.

Answer 182

Experimental design refers to planning an experiment in order to collect data to answer your research question. For example, a data professional might design an experiment to discover whether: A new medicine leads to faster recovery time A new website design increases product sales A new fertilizer increases crop growth A new training program improves athletic performance

Answer 183

1. Define your variables (i.e., the independent and dependent variables) --independent: what you're interested in investigating (e.g., the medicine) --dependent: the effect you're interested in measuring (e.g., the recovery time) 2. Formulate your hypothesis 3. Assign test subjects to treatment and control groups

Answer 184

Define the independent and dependent variables in the experiment -The independent variable refers to the cause you’re interested in investigating. A researcher changes or controls the independent variable to determine how it affects the dependent variable. “Independent” means it’s not influenced by other variables in the experiment. -The dependent variable refers to the effect you’re interested in measuring. “Dependent” means its value is influenced by the independent variable. Example: In your clinical trial, you want to find out how the medicine affects recovery time. Therefore: Your independent variable is the medicine—the cause you want to investigate. Your dependent variable is recovery time—the effect you want to measure.

Answer 185

Formulate the null & alternative hypothesis. Example: Your null hypothesis (H0) is that the medicine has no effect. Your alternative hypothesis (Ha) is that the medicine is effective.

Answer 186

Experiments such as clinical trials and A/B tests are controlled experiments. In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.

Answer 187

In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.

Answer 188

factors that can affect the result of an experiment, but are not of primary interest to the researcher.

Answer 189

Blocking: arranging test subjects in groups, or blocks, that are similar to one another.

Answer 190

A group of statistical techniques that use existing data to estimate the relationships between a single dependent variable and one or more independent variables

Answer 191

Plan, Analyze, Construct, & Execute Plan: Understand your data in the problem context. Contextualize and understand the data and the problem. Analyze: EDA (exploratory data analysis), check model assumptions, & select model. Determine if we should move forward with building the model. Construct: Construct and evaluate the model. Determine how well your model fits the data. Execute: Interpret the model and share the story. Descriptions must take into account the context of the data.

Answer 192

Statements about the data that must be true to justify the use of particular data science techniques

Answer 193

A technique that estimates the linear relationship between a continuous dependent variable and one or more independent variables. Examples: Relationship between the prices of a product (x values) and the number of sales (y value)

Answer 194

The variable a given model estimates, which is also referred to as a response or outcome variable.

Answer 195

A variable that explains trends in the dependent variable, which is also referred to as an explanatory or predictor variable.

Stats Flashcards

(222 cards)