Statistics_4E_Freedman Flashcards

Feedman, Pisani, Purves - 2007 (107 cards)

1
Q

Controlled Trial / Experiment

A

A study where investigators apply a treatment to a group of subjects and compare the outcome to a control group that receives no treatment or a placebo.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Method of Comparison

A

The fundamental principle of establishing causation in statistical studies. It involves comparing the outcomes of two (or more) groups of subjects: the treatment group (which receives the intervention being tested) and the control group (which serves as the baseline, often receiving a placebo or standard care).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Treatment Group

A

The group of subjects in a study that receives the intervention being tested (e.g., a new drug, a specific diet, an educational program, or the Salk vaccine). The results for this group are compared to the control group to measure the effect of the intervention.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Control Group

A

The group of subjects in a study that is used as a baseline for comparison. They are treated identically to the treatment group in every way, except they receive a placebo (or standard treatment) instead of the actual intervention being tested.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Random Assignment

A

The use of an impersonal chance procedure (like flipping a coin or drawing names) to divide subjects into treatment and control groups. Its purpose is to ensure, on average, that the groups are balanced and equivalent with respect to all confounding factors, both known and unknown.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Double-Blind Study

A

A controlled experiment in which neither the subjects receiving the treatment nor the researchers/diagnosticians evaluating the outcome know who is in the treatment group and who is in the control group. This design is used to prevent the placebo effect in subjects and diagnostic bias in observers.

Can lead to diagnostic bias if not implemented

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Confounding Variable

A

A variable that is associated with both the treatment being studied and the response/outcome. Because the treatment and the confounder are mixed up, it is impossible to determine whether the observed effect is due to the treatment or the confounder. This is the main reason why observational studies can only establish association, not causation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Randomized Controlled Trial/Experiment (RCT)

A

Gold standard of study design, where investigators use an impartial chance procedure to assign subjects to treatment or control groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Placebo

A

Neutral, inactive treatment given to the control group in an experiment, designed to resemble the actual treatment. It’s used to blind subjects and control for the placebo effect, which is the psychological tendency for subjects to show an effect simply because they believe they are receiving a treatment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eligible/Target Population

A

Entire group of individuals that a study intends to describe or draw conclusions about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample Population

A

Smaller population taken from the larger eligible/target population that the experiment is performed on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Historical Controls

A

Historical controls refer to a type of study design where the control group is not selected and run concurrently with the treatment group, but rather consists of subjects from a previous study or patients whose outcomes are known from the past.

This design is often considered weaker than a Randomized Controlled Experiment because there’s a huge risk of confounding—the differences between the treatment group and the historical control group (like changes in medical care, diagnostic standards, or population characteristics over time) can easily bias the results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Contemporaneous Controls

A

Contemporaneous controls refer to subjects in an experiment who are treated exactly the same as the treatment group, except for the intervention being studied, and are followed over the same period of time. They are essential for a good Randomized Controlled Experiment (RCE) because they eliminate the bias inherent in Historical Controls (like changes in the population, environment, or medical care).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Response

A

Response is the formal term for the outcome that is measured in a study or experiment.

In statistics, the response variable (or dependent variable) is the characteristic that the investigator is interested in measuring or comparing to see if it changes when a factor (the treatment or explanatory variable) is applied.

For example, if a study tests a new fertilizer, the treatment is the fertilizer, and the response might be the plant’s height or the size of the yield.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Q: Difference between placebo and control group?

A

The key distinction is that the control group is a set of participants, while the placebo is a type of treatment they may receive.

In short, the Control Group is the set of subjects who receive the Placebo, or sometimes standard care or no treatment at all. The control group ensures any observed effect is due to the actual treatment and not other factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Q: What problem(s) arise when the treatment and control groups are NOT created using random assignment?

A

The primary problem that arises when treatment and control groups are NOT created using random assignment is confounding (or selection bias).

Confounding and Selection Bias

  • Confounding: Without random assignment, there is a risk that the treatment and control groups differ systematically in ways (called confounding factors or lurking variables) that are related to the outcome.
  • The Effect: This makes it impossible to confidently conclude that any observed difference in the outcome is due to the treatment itself. The difference might be explained by the pre-existing differences between the groups.

Example: If a researcher non-randomly assigns healthier, younger participants to the “Treatment” group and older, less healthy participants to the “Control” group, and the treatment group has better outcomes, the researcher cannot distinguish whether the improvement was due to the treatment or the participants’ naturally better health/age.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Q: What are the defining characteristics of a well-designed randomized controlled trial (RCT)?

A
  1. Random Assignment (to ensure comparability and minimize confounding).
  2. Control Group (to provide a baseline for comparison).
  3. Placebo/Blinding (to account for the placebo effect and experimenter bias).
  4. Intentional Manipulation (The investigator assigns the treatment to establish causation). Crucial distinction from an observational study.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Q: Why is it important for the treatment and control groups to be comparable, and how does randomization achieve this

A

Importance of Comparability: Comparability is vital because it ensures that the only systematic difference between the groups is the treatment itself. If the groups are not comparable, any difference in outcome could be due to a confounding variable (e.g., age, health, lifestyle) instead of the treatment, making it impossible to establish causation. Role of

Randomization: Randomization (random assignment) achieves comparability by acting like a fair chance mechanism. It ensures that, on average, all known and unknown confounding variables are distributed roughly equally between the treatment and control groups, thus eliminating selection bias and making the groups statistically equivalent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Observational Study

A

Definition: A study where the researcher observes and measures subjects and variables without intervention or manipulation of treatment.

Key Distinction: The investigator does not assign treatments; subjects self-select into groups (e.g., people who choose to exercise vs. those who don’t).

Main Limitation: It cannot establish causation (cause-and-effect) due to the high risk of confounding variables (lurking factors that differ between the groups). It can only show association or correlation.

When Used: When a Randomized Controlled Trial (RCT) is unethical (e.g., studying harmful exposures) or impractical (e.g., studying a rare trait or long-term phenomenon).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Q: Difference between a Controlled Experiment and an Observational Study?

A

In a Controlled Experiment, the investigator actively intervenes by randomly assigning subjects to treatment and control groups. This control allows for establishing causation (cause-and-effect).

In an Observational Study, the investigator is passive, simply observing subjects who have self-selected into groups. This can only show association (correlation), not causation, due to the risk of confounding variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Correlation / Association

A

Association (or correlation) describes a relationship between two or more variables, meaning that certain values of one variable tend to occur with certain values of another.

Definition: Variables are said to be associated if knowing the value of one variable gives you information about the likely value of the other.

The Crucial Limit: Finding an association does not prove causation. Just because two things are related doesn’t mean one causes the other. The relationship might be due to a confounding variable.

Example: There is an association between carrying a lighter and getting lung cancer, but carrying a lighter doesn’t cause cancer; smoking is the confounding variable that causes both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Q: Difference between association and causation?

A

The distinction between association and causation is based on whether one variable is proven to cause the other.

Association (Correlation): This means two variables tend to occur together or change together. An association can be shown by both observational studies and controlled experiments. However, association does not imply causation; the relationship might be due to a third, lurking variable (a confounder).

Causation (Cause-and-Effect): This means a change in one variable is directly responsible for a change in the other. Causation can only be strongly established by a well-designed Randomized Controlled Experiment (RCT), where randomization balances out other potential factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Stratification / Cross-Tabulation

A

Definition: A statistical technique used primarily in observational studies (and sometimes in experiments) to divide the sample into smaller, homogeneous sub-groups called strata based on a potential confounding variable.

Purpose: To see if an association observed in the overall data holds true within each stratum. This helps to control for (or adjust for) the confounding variable.

How it Works: The data is broken down into a table (a cross-tabulation) where the effect of the treatment is examined for each level of the confounding variable (e.g., comparing groups by treatment status separately for young subjects and old subjects).

Outcome: If the association disappears after stratification, it suggests the original association was spurious (fake) and entirely due to the confounding variable. If the association persists within the strata, it strengthens the argument for a real link.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Simpson’s Paradox

A

Definition: A phenomenon where an association or trend that appears in several different groups of data (strata) reverses or disappears when the groups are combined.

Cause: It is caused by a powerful, unaccounted-for confounding variable that is unequally distributed among the sub-groups.

Significance: It demonstrates the danger of combining data from incomparable groups in observational studies. When the data is stratified (broken down), the true relationship is often revealed.

Key Idea: The association you see in the overall (combined) table is the wrong conclusion; the association seen in the smaller, stratified tables is usually the correct one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Descriptive Statistics
Definition: Methods used to summarize and describe the main features of a dataset. Purpose: To make data more manageable and easier to interpret. Techniques: Includes calculating averages (mean, median), measuring spread (standard deviation, range), and creating visual summaries (histograms, scatter diagrams). Contrast: It only describes the sample data; it does not draw conclusions or make inferences about a larger population.
26
Histogram
Definition: A graphical tool used to display the distribution of a quantitative variable (e.g., height, income, test scores). Construction: The horizontal axis shows the possible values of the variable, divided into classes (intervals or bins). The vertical axis shows the frequency (or percentage) of observations that fall into each class. Key Feature: The area of each rectangular bar is proportional to the number of cases in that class. Purpose: To quickly visualize the shape of the distribution (e.g., symmetric, skewed), its center, and its spread.
27
Class Intervals
Definition: The ranges of values used to group quantitative data when constructing a histogram or frequency table. Characteristics: They must cover all the data, be non-overlapping, and are typically of equal width (though not always). Purpose: To simplify a wide variety of data values into a small, manageable number of categories so the distribution can be clearly visualized and summarized. Freedman's Principle: The choice of interval width significantly affects the shape of the histogram; too wide, and detail is lost; too narrow, and the distribution looks jagged.
28
Q: When examining a histogram, what does the area of the block represent?
Percentage of total
29
Distribution/Frequency Table
Definition: A table that summarizes the distribution of a variable by listing the class intervals (or values) and the frequencies (or percentages) of the observations that fall into each interval. Purpose: To clearly organize and present raw data in a way that shows how the data is spread out and where the majority of observations lie. Relationship to Histogram: A distributional table is essentially the numerical source data used to create a histogram; the intervals are the bases of the bars, and the frequencies are the heights of the bars. Also Known As: A frequency table (when using counts) or a relative frequency table (when using percentages).
30
Endpoint Convention
Definition: The rule used when defining class intervals for a histogram or distributional table to determine exactly where observations that fall on a boundary (or endpoint) should be counted. The Standard Rule: In Freedman's text, the standard convention is that an observation that falls exactly on a class boundary is counted in the interval to its right (the next higher interval). Example: If intervals are 10−20 and 20−30, an observation of 20 is counted in the 20−30 interval, not the 10−20 interval. Purpose: To ensure that every observation is counted in one and only one class interval, making the frequency table and histogram accurate and unambiguous.
31
Density Scale
Definition: A vertical scale used in a histogram where the height of a bar is made equal to the percentage of cases in the interval divided by the width of the interval. Formula: Height = Percentage of Cases​ / Width of Interval Purpose: To make the histogram's visual interpretation accurate. Using the density scale ensures that the area of the bar, not just the height, represents the percentage of cases in that class interval. When Used: It's essential when the class intervals have unequal widths. If widths are equal, the area is proportional to the height, and a simple percentage or frequency scale is sufficient.
32
Q: In a histogram, what does the height of the block represent?
Percentage per horizontal unit (density)
33
Variable
Definition: A characteristic or attribute that can take on different values from one individual or unit of observation to another (e.g., height, income, gender, blood pressure). Purpose: Variables are the basic building blocks of any statistical study, as researchers measure and analyze their distributions and relationships. Types (as discussed in the text): Quantitative: Measured in numbers (e.g., age, weight). Qualitative (or Categorical): Sorted into distinct categories (e.g., occupation, blood type). Key Idea: If an attribute were the same for every subject in a study (a constant), it would be useless for statistical analysis.
34
Q: What are the two types of variables
1. Qualitative 2. Quantitative
35
Q: What are the two types of quantitative variables?
1. Discrete 2. Continuous
36
Q: How do we summarize the center of a data set?
Average
37
Q: How do we measure the spread around the center of a data set?
Standard Deviation
38
Average
Definition: A single, representative value that describes the center or typical size of a set of numbers. Intuitive Explanation: If you were to smooth out the data—taking from the high values and giving to the low values—the average is the height or amount everyone would end up with. It's the leveling point of the distribution. Most Common Type (Arithmetic Mean): Calculated by summing all the values in a list and then dividing by the count of those values. Purpose: It allows you to summarize a large dataset with a single, easily comparable number.
39
Standard Deviation
Definition: A measure of the spread or variability of a list of numbers, showing how much the values typically differ from the average (mean). Intuitive Explanation: It's the typical distance between an individual data point and the center of the data. A small SD means the data points are tightly clustered around the average; a large SD means the points are widely scattered. Key Property: It is calculated using the Root Mean Square (RMS) error, which essentially finds the average size of the deviations (differences) from the mean. Relationship to Variance: The variance is the SD squared (SD2). The SD is more commonly used because it is in the same units as the data (e.g., dollars, inches), unlike the variance.
40
Mean
Definition: The arithmetic average of a list of numbers, calculated by summing all the values and dividing by the total count of values. Intuitive Explanation: The mean is the balance point of the data distribution. If you put all the values on a seesaw, the mean is the exact spot where the seesaw would balance. It's the fair share or the amount each item would receive if the total were distributed equally. Key Property: The sum of the deviations (differences) of all data points from the mean is always zero.
41
Interquartile Range
Definition: A measure of spread that quantifies the difference between the first quartile (Q1​) and the third quartile (Q3​). Formula: IQR=Q3​−Q1​ Intuitive Explanation: The IQR gives the range of the middle 50% of the data. It tells you how spread out the most central half of your values are, ignoring the extreme 25% on the low end and the 25% on the high end. Key Advantage: Unlike the standard deviation, the IQR is robust (resistant) to outliers (extreme values), making it a reliable measure of spread for skewed distributions.
42
Median
Definition: The middle value in a list of numbers that has been arranged in numerical order. If there's an even count of numbers, it's the average of the two middle values. Intuitive Explanation: The median is the value that splits the data exactly in half; 50% of the observations are below it, and 50% are above it. It's the point where you'd cut a distribution to have an equal number of cases on either side. Key Advantage: It's robust (resistant) to outliers (extreme values) because its calculation only depends on the position of the data points, not their magnitude. Use: The median is often preferred over the mean for data that is heavily skewed (like income or house prices).
43
Mode
Definition: The value or category that appears most frequently in a dataset or distribution. Intuitive Explanation: It's the most popular choice or the most common result. If you looked at a histogram, the mode is the value under the tallest bar. A distribution can have one mode (unimodal), two modes (bimodal), or more. Use: It is the only measure of center that can be used for qualitative (categorical) data (e.g., favorite color, brand preference), as you cannot calculate a mean or median for categories. Limitation: It is easily affected by small changes in data and can sometimes be a poor representation of the center if the rest of the data is far away.
44
Cross-Sectional Survey
Definition: A survey that measures a snapshot of a population at a single point in time. Data is collected on both the variables of interest and the outcome simultaneously. Intuitive Explanation: It's like taking a photo of a group—you capture information right now, but you don't track how things change over time. Study Type: It is a type of observational study. Limitation: It is very difficult to establish a clear temporal order (which came first: the exposure or the outcome), making it weak evidence for causation. You can only confirm that an association exists at that moment in time.
45
Longitudinal Survey
Definition: A type of observational study where the same subjects are followed and measured repeatedly over a long period of time. Intuitive Explanation: It's like taking a movie of a group—you track them over months or years, observing how variables and outcomes change as time progresses. Key Advantage: Because the data is collected over time, it helps researchers establish temporal precedence (which came first—the exposure or the outcome), which strengthens the evidence for a possible causal link more than a single cross-sectional survey does. Examples: Cohort studies (following a group forward in time) and panel studies. Limitation: They are expensive, time-consuming, and suffer from attrition (subjects dropping out over the years), which can introduce bias.
46
Q: Why do we need to understand if the survey data is cross-sectional or longitudinal?
We need to understand if survey data is cross-sectional or longitudinal primarily to determine the strength of the evidence for a time-based relationship and to assess the study's limitations. Establishing Cause and Effect (Causation): Longitudinal studies are better because they track subjects over time, helping to establish temporal precedence—that the exposure or cause actually happened before the outcome or effect. This is a critical requirement for suggesting causation. Cross-Sectional studies measure everything at once, making it difficult or impossible to tell which variable came first. Assessing Changes and Trends: Longitudinal data allows researchers to measure changes within the same individual over time, track trends, and determine the duration of an effect. Cross-Sectional data can only show differences between individuals at one moment; it cannot show how any individual variable changes. Evaluating Bias: Cross-Sectional surveys are highly susceptible to confounding. Longitudinal surveys face challenges with attrition (subjects dropping out) and can be very expensive and time-consuming. In short, knowing the survey type tells you what conclusions you can reasonably draw from the data, especially regarding the critical difference between mere association and a potential causal link.
47
Root-Mean-Square (RMS)
Definition: A general statistical calculation used to find the typical size or magnitude of a set of numbers, especially when those numbers include both positive and negative values (like deviations). Intuitive Explanation: It's the average size of a list of numbers, ignoring their sign. It's the standard way to measure the typical size of deviations, giving larger weight to larger numbers. Steps to Calculate: Square (S): Square all the numbers in the list. Mean (M): Find the average (mean) of those squares. Root (R): Take the square root of the mean. Key Application: The Standard Deviation (SD) is calculated as the RMS of the deviations from the mean.
48
Normal Distribution
Definition: A specific, symmetrical, bell-shaped distribution that frequently arises in natural phenomena and statistical theory. It is characterized entirely by its mean (μ) and its standard deviation (σ). Intuitive Explanation: It's the ideal shape for many large, naturally occurring datasets (like heights, weights, or measurement errors). It shows that most observations cluster around the average (the center), and values become increasingly rare the further they deviate from the average. Key Property (The 68-95-99.7 Rule): For any Normal Distribution: Roughly 68% of the data falls within 1 SD of the mean. Roughly 95% of the data falls within 2 SDs of the mean. Roughly 99.7% of the data falls within 3 SDs of the mean. Significance: It serves as a crucial benchmark distribution in statistics, especially for understanding variation and performing inference.
49
Empirical Rule
Definition: A quick, practical rule that approximates the spread of data for any Normal Distribution using the standard deviation. Intuitive Explanation: It tells you where the vast majority of your data will fall, as long as the distribution looks like a bell-shaped curve. It's a handy way to estimate percentages without complex calculations. The Three Key Percentages: Approximately 68% of the data falls within 1 Standard Deviation (SD) of the mean. Approximately 95% of the data falls within 2 SDs of the mean. Approximately 99.7% of the data falls within 3 SDs of the mean. Condition: The rule only applies to data that follows the Normal Curve.
50
Standard Units (or Z-score)
Definition: A standardized measure that shows how many standard deviations (SDs) a value is above or below the average (mean). Intuitive Explanation: Standard units are a way to level the playing field and compare apples to oranges. By converting different types of measurements (like height and weight, or scores on different tests) to a common unit (the SD), you can easily see which value is relatively more extreme. Formula: Standard Units=(Value−Average)/SD​ Interpretation: A positive standard unit means the value is above the average; a negative standard unit means the value is below the average. A standard unit of 0 means the value is exactly the average.
51
Normal Approximation
Definition: The practice of using the Normal Distribution (Normal Curve) to estimate or approximate the percentages, probabilities, or counts within a given range for a different, non-normal dataset or distribution. Intuition: Many real-world distributions—especially those built from chance processes, like sums or averages—are very close to the Normal Curve, even if the underlying data isn't perfectly normal. The Normal Approximation allows us to use the well-known 68-95-99.7 Rule and the Standard Unit calculations to quickly and effectively estimate percentages for these complex distributions. Key Step: To use the Normal Approximation, you must first convert the data value(s) into Standard Units (Z-scores) using the mean and standard deviation of the dataset. You then use a Normal Curve table to find the desired percentage. Applicability: It works best when the original dataset has a large number of observations and the histogram's shape is close to the smooth, symmetrical bell curve.
52
Percentile
Definition: A value in a distribution such that a specific percentage of the data falls at or below that value. Intuition: It tells you what percentage of people, scores, or measurements you outperformed or are equal to. For example, if you score in the 90th percentile, 90% of the scores are below yours, and 10% are above. Special Cases: The median is the 50th percentile. The interquartile range (IQR) is defined by the 25th percentile (Q1​) and the 75th percentile (Q3​). Use: Percentiles are particularly useful for understanding relative standing in skewed data where the mean and standard deviation may not be good measures of center and spread.
53
Percentile Rank
Definition: The percentage of scores or values in a distribution that are equal to or less than a particular value. Intuition: The percentile rank tells you the relative standing of a specific score compared to the entire group. If your test score has a percentile rank of 80, it means you scored as well as or better than 80% of the people who took the test. Contrast with Percentile: A percentile is a value (a score); a percentile rank is a percentage (the position of the score). Use: It is primarily used to interpret individual scores by placing them in the context of a larger distribution.
54
Q: Difference between "percentile" and "percentile rank"?
The difference between "percentile" and "percentile rank" is what they measure: one is a value from the dataset, and the other is a percentage of the data. Percentile Definition: A value (a score, height, etc.) in a distribution such that a specified percentage of the data falls at or below that value. Intuition: It's the boundary line. For example, the 80th percentile is the specific income amount that separates the bottom 80% of earners from the top 20%. Output: A value in the original units of the data (e.g., 75 inches, 550 test score, or $80,000). Percentile Rank Definition: The percentage of scores or values in a distribution that are equal to or less than a particular value. Intuition: It's the position. For example, if a test score of 550 has a percentile rank of 80, it means that 80% of all test-takers scored 550 or lower. Output: A percentage (a number between 0 and 100).
55
Change of Scale
Definition: The process of adjusting a list of numbers by either adding a constant to every value (shifting the data) or multiplying by a constant (rescaling the data), or both. Intuition: This describes what happens to the summary statistics (like the average and standard deviation) when you change the units of measurement. For instance, converting temperatures from Celsius to Fahrenheit involves both a multiplication and an addition. Effect on Mean/Average: If you add/subtract a constant to every number, the mean changes by that exact amount. If you multiply/divide by a constant, the mean changes by that exact factor. Effect on Standard Deviation (SD): Adding/Subtracting a constant does not change the SD (because the spread between the numbers remains the same). Multiplying/Dividing by a constant changes the SD by that factor (because the distances between numbers are also stretched or compressed). Key Idea: The standard deviation measures spread, which is independent of the location of the mean.
56
Q: What happens to your MEAN/AVERAGE when you ADD a constant to every value in data set?
Mean increase/decrease by the constant amount
57
Q: What happens to your MEAN/AVERAGE when you MULTIPLY by a constant for every value in data set?
Mean is multiplied by constant value
58
Q: What happens to your STANDARD DEVIATION when you ADD a constant to every value in data set?
No change
59
Q: What happens to your STANDARD DEVIATION when you MULTIPLY by a constant for every value in data set?
SD is multiplied by constant value
60
Q: What happens to your CORRELATION COEFFICIENT, r, when you ADD a constant to every value in data set?
No effect
61
Q: What happens to your CORRELATION COEFFICIENT, r, when you MULTIPLY by a constant for every value in data set? Why?
No effect
62
Measurement Error
Definition: The inevitable chance variation that occurs when a quantity is measured, meaning the recorded value differs from the true value. Formula: Measured Value=True Value+bias+chance error Intuition: No measurement tool is perfectly precise, and no human can use a tool perfectly. The error represents the combination of the instrument's limitation and human imperfection. It is assumed to be a random chance error. Key Idea: Measurement errors are generally assumed to be independent and follow the Normal Distribution, clustering around zero. This means small errors are common, and large errors are rare, with no systematic tendency to be too high or too low.
63
Chance Error
Definition: The component of measurement error that is random and unpredictable. It causes a measurement to be sometimes too high and sometimes too low, with no systematic pattern. Intuition: It represents the small, inevitable fluctuations that happen every time a measurement is taken, such as slight variations in reading a scale, environmental noise, or the inherent imprecision of the instrument. It is due to chance. Key Property: Chance errors are assumed to average out to zero over many repeated measurements. They are also generally assumed to follow the Normal Distribution. Contrast: It differs from bias (or systematic error), which is a consistent error that pushes the measurement in one direction (always too high or always too low). We can measure the chance error by taking the standard deviation of the repeated measurements made under identical conditions
64
Calibration
Definition: The process of checking and adjusting a measurement instrument to ensure its readings are accurate relative to a known standard. Intuition: It's the act of zeroing or correcting a tool so that the measurements it provides are reliable and trustworthy. For example, before using a scale, you check that it reads zero when nothing is on it. Purpose in Statistics: Proper calibration helps to eliminate or minimize systematic error (bias). If a scale is consistently reading 5 pounds too high, calibration is the step that fixes this bias, ensuring that any remaining error is only due to random chance. Contrast: Calibration deals with systematic error (bias), while statistical analysis deals primarily with chance error.
65
Outlier
Definition: An observation (a value) that is far away from the rest of the data points in a distribution. Intuition: It's the odd one out—a score, measurement, or data point that seems extreme or highly unusual compared to the bulk of the data. Causes: Outliers can be caused by a simple error in measurement or recording, or they can be genuine, rare occurrences that correctly reflect an extreme case in the population. Significance: Outliers can have a disproportionate effect on some summary statistics: They heavily pull the Mean towards them. They dramatically increase the Standard Deviation (SD). They have little to no effect on the Median or the Interquartile Range (IQR).
66
Bias / Systematic Error
Definition: A consistent, systematic error that causes a measurement or estimate to be consistently too high or consistently too low. Intuition: It's a one-sided error that doesn't cancel out, even if you repeat the process many times. Think of a scale that is poorly calibrated and always reads five pounds heavy—that's bias. Source in Measurement: Caused by a faulty instrument, a consistent environmental factor, or a flaw in the procedure (e.g., neglecting to calibrate the tool). Source in Studies: Caused by flaws in the study design, such as selection bias (non-comparable groups) or response bias (subjects misreporting information). Key Contrast: Unlike chance error, which is random and averages out to zero, bias must be addressed by calibration or a better study design.
67
Cartesian Coordinates
Definition: A system that uses a set of numbers (coordinates) to uniquely locate a point in space (typically a plane). It consists of two perpendicular number lines: the horizontal x-axis (abscissa) and the vertical y-axis (ordinate), which intersect at the origin (0,0). Intuition: It's like giving someone driving directions on a grid. To find any point, you first say how far to go horizontally (the x-value) and then how far to go vertically (the y-value), written as (x,y). Use in Statistics: Cartesian coordinates are the foundation for nearly all statistical graphs, including scatter diagrams (to show the relationship between two variables) and histograms (where frequency/density is the y-axis and the variable value is the x-axis). Named After: The French philosopher and mathematician René Descartes.
68
Slope
Definition: A measure of the steepness and direction of a line. In a scatter diagram or regression line, it quantifies how much the dependent variable (y) changes for every one-unit change in the independent variable (x). Intuition: The slope is the rate of change or the "rise over run." It tells you exactly how fast and in what direction the line is going. A steep slope means a small change in x leads to a large change in y. Formula: slope = change in y / change in x = y2-y1 / x2 - x1 = rise/run Interpretation: Positive Slope (+): The line goes up from left to right, indicating a positive association (as x increases, y increases). Negative Slope (-): The line goes down from left to right, indicating a negative association (as x increases, y decreases). Slope of Zero (0): A horizontal line, indicating no association between x and y.
69
Intercept
Definition: The point where a line (such as a regression line) crosses the vertical axis (y-axis). It is the value of y when the independent variable, x, is equal to zero. Intuition: The intercept gives you the starting value or the baseline amount of the dependent variable (y) before any influence from the independent variable (x) has occurred. Formula (in the context of a line): It's the term 'a' in the equation of a line: y=a+bx. Interpretation: The intercept is only meaningful in context if it is plausible for x to be zero and if the data supports extending the line back to that point. In many statistical contexts (like predicting adult height from infant weight), an x value of zero may be outside the range of the observed data, making the intercept value practically meaningless.
70
Q: What is the equation of a line?
y = mx + b where: - m = slope - b = intercept
71
Scatter Diagram (or Scatter Plot)
Definition: A graphical tool used to display the relationship between two quantitative variables for a set of individuals. Intuition: It lets you visually check for an association between two variables. Each subject in the study is represented by a single point on the graph, located using Cartesian coordinates (one variable on the x-axis, the other on the y-axis). Purpose: To reveal the direction, form, and strength of the relationship: Positive Association: The points cluster around an upward-sloping line (as x increases, y increases). Negative Association: The points cluster around a downward-sloping line (as x increases, y decreases). No Association: The points look like a cloud with no clear direction. Key Idea: The points show the raw data, allowing you to easily spot outliers or non-linear patterns.
72
Correlation Coefficient
r How can we summarize the correlation between two variables (and independent and dependent variable)? Convert each variable to standard units and then take the average product Reversible. r(x,y)=r(y,x) Definition: A single number that measures the strength and direction of the linear association between two quantitative variables displayed in a scatter diagram. Measures clustering around a line, relative to the SDs Intuition: It tells you how tightly the points cluster around a straight line. An r value close to +1 or −1 means the points are tightly clustered, indicating a strong linear relationship. An r value close to 0 means the points are widely scattered, indicating a weak or non-existent linear relationship. Range: The correlation coefficient always falls between −1 and +1 (i.e., −1≤r≤1). Sign: The sign ( + or − ) indicates the direction of the association (positive or negative slope). Limitation: It only measures linear association. A strong non-linear curve (like a U-shape) might have a correlation coefficient close to zero.
73
Point of Averages
Definition: The single point on a scatter diagram whose coordinates are the average (mean) of the independent variable (xˉ) and the average (mean) of the dependent variable (yˉ​). Coordinates: The point is (xˉ,yˉ​). Intuition: It represents the typical or central subject in the entire dataset. It is the center of mass for all the points on the scatter diagram. Key Property: The regression line (the line of best fit used for prediction) always passes through the point of averages. This ensures that the line is centered on the data and makes predictions relative to the mean of both variables.
74
Q: What are the 5 summary metrics for two variables?
1. average of x-values 2. SD of x-values 3. average of y-values 4. SD of y-values 5. correlation coefficient, r
75
SD Line
Definition: A line drawn on a scatter diagram that passes through the point of averages (xˉ,yˉ​) and has a slope equal to: slope= (SD of y)/(SD of x) Intuition: The SD Line helps visualize the spread of the data in Standard Units. It shows the slope the line would have if the correlation coefficient (r) were +1 or −1. In other words, it represents the tightest possible linear clustering for the given SDs. Key Contrast with Regression Line: Unlike the regression line, the SD line is not used for prediction. The regression line is always closer to the horizontal axis than the SD line (except when r=±1), illustrating the principle of regression to the mean. Purpose: It serves as a visual reference to assess the strength of the correlation (r); the closer the scatter of points is to the SD line, the stronger the linear association.
76
Regression Method
Definition: A statistical technique used to predict the value of one variable (the dependent variable, y) based on the value of another variable (the independent variable, x). Intuition: It finds the best straight line (the regression line or "line of best fit") that passes through the scatter diagram. This line minimizes the overall prediction errors (specifically, the sum of the squared vertical distances from the points to the line). It acts as the most informed estimate for y given x. Slope (b) = r * SD of y / SD of x Significance: Because the regression slope is a fraction of the SD line slope (unless r=±1), it mathematically embodies the principle of regression to the mean: predictions for extreme x values are always less extreme in y (closer to the average yˉ​) than they would be if the prediction followed the SD line. For every one increase in SD in x, there is an increase of r SDs in y, on average. This plotted is the regression estimate or regression line for y on x.
77
Graph of Averages
Definition: A graphical tool used to summarize a scatter diagram by plotting the average y-value for each distinct x-value (or for each class interval of x). Intuition: Instead of showing hundreds of individual points, the graph of averages shows the trend more clearly. It collapses the vertical scatter for each x-value down to a single representative point—the mean of the y's for that x. It smooths out the noise to reveal the underlying relationship. Key Property: The graph of averages is used to check for linearity. If the original scatter diagram shows a linear association, the graph of averages will also follow a straight line. If the original association is non-linear (curved), the graph of averages will show that same non-linear curve. Relationship to Regression: The regression line is simply the straight line that best fits the points on the graph of averages.
78
The Regression Fallacy
Definition: The common mistake of attributing a real cause-and-effect relationship to observed changes that are actually due entirely to the statistical principle of regression to the mean. Intuition: It's the error of thinking that a variable must have changed because of some action when, in reality, it was just an unusually extreme measurement moving back toward the average. Example: A basketball player has an amazing, extreme scoring night. If the coach criticizes them afterward, and the player scores closer to their average the next game, the coach might mistakenly believe the criticism "worked," when the score was simply regressing to the mean. Key Property: This fallacy occurs whenever one makes an observation after an extreme measurement. Because of natural random variation (chance error), the next measurement is very likely to be less extreme (closer to the long-run average), regardless of any external factor. Significance: It highlights the need for a control group in experiments. Without a control group to measure natural regression, one might falsely conclude that a treatment or intervention caused a change.
79
Regression Effect
Definition: The statistical tendency for subjects or measurements that are extreme on a first measurement to be closer to the average on a second, subsequent measurement. Intuition: This is a natural consequence of chance error. When you observe an extreme value (either very high or very low), part of that extremeness is likely due to good or bad luck (random chance). The next time you measure, that luck component is likely to be closer to zero, causing the overall value to "regress" or move back toward the long-run average. Mechanism: The effect is only observed when the two variables being measured are not perfectly correlated (r is not ±1). The weaker the correlation (the closer r is to 0), the stronger the regression effect. Significance: The Regression Effect is the statistical truth that underlies the Regression Fallacy (the error of misinterpreting this natural movement toward the average as a caused effect).
80
Q: The regression method can be used to predict y from x. However, actual values differ from predictions (residuals) - but by how much? How do we measure the overall size of the differences?
RMS Error for Regression
81
Error
Definition: In statistics, error refers to the difference between an observed value and a predicted, theoretical, or true value. It represents the inherent variability or lack of perfect precision in data. Intuition: Error is the amount by which you miss the mark. It's the inevitable difference between what you measure or predict and the actual fact.
82
Residual
Definition: The vertical distance between an observed data point and the corresponding point on the regression line (the predicted value). It represents the error in prediction for a specific observation.
83
Selection Bias
Definition: A systematic flaw in a study's design where the individuals or groups being compared are fundamentally different in a way that relates to the outcome being studied, before the treatment or exposure is even applied. Intuition: It means the groups weren't comparable to start with. The study suffers from a "non-comparable groups" problem. Any observed difference in the outcome could be due to these pre-existing differences rather than the treatment itself. Cause: It typically occurs in Observational Studies when subjects self-select their group (e.g., smokers vs. non-smokers), or when the researcher chooses non-random, non-representative samples. Key Solution: The best way to eliminate selection bias is through Random Assignment in a Randomized Controlled Experiment (RCT). Randomization ensures that, on average, the treatment and control groups are comparable in all aspects, both known and unknown. Example: Comparing the health of people who choose to run marathons to those who don't. The marathon runners were likely healthier, to begin with, leading to selection bias.
84
Residual Plot
Definition: A scatter diagram that plots the residuals (the prediction errors) on the vertical (y) axis against the independent variable (x) or the predicted values (y^​) on the horizontal (x) axis. Intuition: It's a way to magnify the errors of the regression line. If the regression line is a good fit, the residual plot should show no pattern—just a random, symmetric cloud of points centered around the horizontal line at y=0. Purpose (Diagnostic Tool): To check the assumptions of the linear regression model: Linearity: If the plot shows a curved pattern (e.g., a "U" or inverted "U"), the true relationship is non-linear, and the linear model is inappropriate. Equal Scatter (Homoscedasticity): If the plot shows a funnel shape (scatter widens or narrows across x), the assumption of constant RMS error is violated. Conclusion: A well-fitting model has a residual plot that looks like a horizontal band of random noise centered at zero.
85
Homoscedastic
Definition: A condition in a statistical model (like linear regression) where the variability (or spread) of the errors (residuals) is constant across all levels of the independent variable (x). Intuition: It means the vertical scatter of the data points around the regression line is the same width everywhere along the line. The data points form a consistent, uniform band. Key Importance: This is an important assumption for many statistical methods. When data is homoscedastic, the RMS Error for Regression is an accurate measure of prediction error across the entire range of the data. Contrast: The opposite is Heteroscedasticity (Heteroskedasticity), where the scatter changes (often forming a "funnel" shape). If data is heteroscedastic, predictions made for areas with low scatter will be less accurate than the RMS error suggests, and predictions for areas with high scatter will be more accurate than the RMS error suggests.
86
Heteroscedastic
Definition: A condition in a statistical model (like linear regression) where the variability (or spread) of the errors (residuals) is not constant across all levels of the independent variable (x). Intuition: It means the vertical scatter of the data points around the regression line is wider in some places and narrower in others. The data points form a shape like a funnel or a wedge. Key Importance: When data is heteroscedastic, the RMS Error for Regression is misleading because it represents a single "average" spread. In areas of high scatter, the prediction error is actually larger than the RMS suggests, and in areas of low scatter, the error is smaller. This invalidates certain statistical inferences. Contrast: The opposite is Homoscedasticity, where the vertical scatter is uniform across the entire range of x.
87
Q: The equation of the regression line for y on x is?
y = slope * x + intercept = mx+b
88
Q: the equation for the simple linear regression slope is?
m = r * SD of y / SD of x Allows us to understand the average change in y that would be caused by a change in x
89
Method of least squares
Definition: The mathematical procedure used to find the equation of the best-fitting line (the regression line) for a set of data points in a scatter diagram. Intuition: The method determines the line that minimizes the total amount of error. It achieves this by finding the line that makes the sum of the squares of the vertical distances (the residuals) from all the data points to the line as small as possible. Why "Square" the Errors? Squaring the residuals accomplishes two things: It ensures that all errors (both positive and negative) contribute equally, preventing positive and negative residuals from cancelling each other out. It penalizes large errors much more heavily than small errors, ensuring the line is a good fit across all the data, not just the center. Outcome: The result is the unique line that provides the most reliable linear prediction of the dependent variable (y) from the independent variable (x).
90
Frequency Theory (of Probability)
Works best for processes which can be repeated over and over again, independently and under the same conditions Definition: The view that the probability of an event is defined by the long-run relative frequency with which the event occurs in a very large number of independent and identical trials. Intuition: Probability isn't just a guess; it's what actually happens over the long haul. If you flip a fair coin many, many times, the proportion of heads you get will stabilize and get closer and closer to 0.5. This stable proportion is the probability. Key Idea: It defines probability based on observability and empirical evidence. For an event to have a probability, it must be repeatable. The probability is the limit of the relative frequency as the number of trials approaches infinity. Contrast: It differs from the Classical (or Equally Likely Outcomes) Theory (where probability is defined by counting possibilities) and the Subjective Theory (where probability reflects a personal degree of belief).
91
Probability / Chance
Definition: A number that quantifies the likelihood of a specific event occurring. It is expressed as a number between 0 and 1 (or as a percentage between 0% and 100%). **Probability** = Number of favorable outcomes / Number of total possible outcomes Intuition: It's a measure of how often something is expected to happen. A probability of 0 (or 0%) means the event is impossible. A probability of 1 (or 100%) means the event is certain. A probability of 0.5 (or 50%) means the event is as likely to happen as not. Theoretical Calculation (Classical): If all possible outcomes are equally likely (e.g., rolling a die), the probability is calculated as: Empirical Calculation (Frequency Theory): The long-run relative frequency of the event occurring in many repeated trials. Key Idea: Chance introduces randomness into the world, and probability provides the mathematical framework for understanding and predicting the long-term patterns within that randomness.
92
Conditional Probability
Definition: The probability of an event occurring given that another event has already occurred. Notation: It is written as P(B∣A), which is read as "the probability of event B given event A." Intuition: It narrows the focus from the entire sample space to a subset of outcomes defined by the condition. It asks, "Out of all the times event A happened, how often did event B happen too?" Formula: P(A|B) = P(A AND B) / P(B) (The probability of both events happening divided by the probability of the condition event A happening.) Key Idea (Dependence): If P(B∣A) is different from P(B), then the two events are dependent; the occurrence of A changes the likelihood of B. If P(B∣A) is equal to P(B), the events are independent.
93
Laws of Probability: Multiplication Rule(s)
Definition: A fundamental probability rule used to calculate the probability that two or more events will all occur (the probability of their intersection). AND RULE Intuition: It's how you figure out the chances of getting "this" and "that" to happen. The core idea is that the chance of the second event happening depends on whether the first event actually occurred. 1. Dependent P(A AND B) = P(A) * P(B|A) Intuition: To get both outcomes, you first need to succeed at A, and then you need to succeed at B out of the restricted group where A has already happened. 2. Independent P(B|A) = P(B) P(A AND B) = P(A) * P(B) Intuition: Since the events don't influence each other, you can simply multiply their individual probabilities together to find the chance of both happening. 3. Mutually Exclusive P(A AND B) = 0 Intuition: A and B can not happen together. One prevents the other.
94
Laws of Probability: Addition Rule(s)
Definition: A fundamental probability rule used to calculate the probability that at least one of two or more events occurs. OR RULE Intuition: It's how you figure out the chances of getting either "this" or "that" to happen. If you simply add the individual probabilities, you might double-count the overlap, so the rule includes a way to correct for that. P(A OR B) = P(A) + P(B) - P(A AND B) Intuition: You add the probabilities of A and B, and then subtract the probability of their overlap (A and B both happening) because you counted that region twice (once in P(A) and once in P(B)). Mutually Exclusive: P(A AND B) = 0 P(A OR B) = P(A) + P(B) - 0 P(A OR B) = P(A) + P(B) Intuition: Since there is no overlap to double-count, you simply add the individual probabilities.
95
Laws of Probability: Complement of an Event
Definition: A probability rule that states the probability of an event not occurring is equal to one (or 100%) minus the probability that the event does occur. P(A) = 1 - P(A') P(A') = 1 - P(A) Intuition: Since an event must either happen or not happen, the probabilities of an event and its complement must sum to 1. This rule is especially useful for calculating the probability of complex events by finding the probability of the simpler opposite event and subtracting it from 1. Example: The chance of drawing at least one ace from a deck is often easier to find by calculating the chance of drawing no aces, and then subtracting that result from 1.
96
Laws of Probability: Mutually Exclusive
No overlap in Venn Diagram Definition: Two or more events are mutually exclusive (or disjoint) if the occurrence of one event precludes (makes impossible) the occurrence of the other. They cannot both happen at the same time. Intuition: The events have no overlap. If you're counting possibilities, the outcomes belonging to one event are completely separate from the outcomes belonging to the other. Example: When rolling a single six-sided die, the event "rolling a 2" and the event "rolling an odd number" are mutually exclusive. You cannot do both on the same roll. Key Rule (Addition Rule for Mutually Exclusive Events): The probability that at least one of two mutually exclusive events (A or B) occurs is simply the sum of their individual probabilities:
97
Laws of Probability: Conditional Probability
P(A|B) = P(A AND B) / P(B) How should I think about this graphically?
98
Laws of Probability: Bayes' Theorem
P(A|B) = P(A AND B) / P(B) P(A AND B) = P(A) * P(B|A) P(A|B) = P(B|A) * P(A) / P(B) Posterior Probability = P(A|B) Prior Probability = P(A) Likelihood = P(B|A)
99
Laws of Probability: Independence
Definition: Two events, A and B, are independent if the occurrence or non-occurrence of one event does not change the probability of the other event occurring. Intuition: The events have no influence on each other. If knowing whether A happened gives you no new information about the likelihood of B happening, they are independent. Mathematical Condition: The conditional probability of B given A is the same as the unconditional probability of B P(B|A)=P(B) Contrast: The opposite is dependence, where knowing the outcome of one event does change the probability of the other.
99
Binomial Probability Mass Function
Definition: A formula used to calculate the exact probability of getting a specific number of "successes" (k) in a fixed number of independent trials (n), where there are only two possible outcomes for each trial (success or failure). Intuition: It answers the question: "If I repeat something n times, what's the chance that exactly k of those times turn out to be a success?" Formula: Complicated Conditions for Using the Binomial Formula (BINS): 1. Binary: Each trial has only two possible outcomes (success or failure) 2. Independent: The outcome of one trial does not affect the outcome of any other trial 3. Number (Fixed): The number of trials, n, is fixed in advance 4. Same Probability: The probability of success, p, is the same for every trial Is an application of the multiplication rule combined with the addition rule
100
Combination
Definition: A selection of items from a larger set where the order of selection does not matter. Intuition: It's about forming a group or a committee. For example, choosing a group of three people (A, B, and C) for a committee is the same combination whether you select them in the order A-B-C or C-B-A. Formula: The number of combinations of selecting k items from a set of n items (read as "n choose k) Contrast with Permutation: In a permutation, the order does matter. A combination is always a smaller number than the corresponding permutation because it eliminates all the repeated arrangements.
101
Permutation
Definition: An arrangement of items in a specific order. It is a selection of a certain number of items from a set where the sequence of selection matters. Intuition: It's about forming a lineup or a ranking. For example, selecting three people (A, B, C) and assigning them roles (President, Vice-President, Secretary) is a permutation, because the order A-B-C is a different outcome than the order C-B-A. Formula: Contrast with Combination: In a combination, the order does not matter. Because order matters for a permutation, the number of permutations will always be greater than or equal to the number of combinations for the same n and k.
102
Q: What is the difference between a combination and a permutation?
The essential difference between a combination and a permutation lies in whether the order of selection matters. A combination is a selection of items where order does not matter (it's a group or a committee). Key Concept: Forming a group. Intuition: Choosing a hand of cards. The hand (King, Queen, Jack) is the same regardless of the order the cards were drawn. A permutation is an arrangement of items where the order of selection does matter (it's an ordered list or a sequence). Key Concept: Forming an ordered arrangement. Intuition: Choosing a lineup or a ranking. Placing a King at position 1, Queen at position 2, and Jack at position 3 is different from placing the Queen at position 1, King at 2, and Jack at 3.
103
Binomial Coefficient
This is the combination formula Definition: The number that represents the number of ways to choose (or combine) k items from a set of n distinct items when the order of selection does not matter. Intuition: It calculates how many unique groups (or combinations) of a certain size can be pulled out of a larger total group. It's the "choose" part of the process.
104
The Law of Averages
Definition: The Law of Averages is a popular but incorrect (fallacious) interpretation of the Law of Large Numbers. It suggests that if an event has occurred less frequently than its expected probability over a short period, it is "due" to occur more frequently in the near future to "balance things out." Intuition (The Fallacy): People mistakenly believe that chance processes have a memory or a self-correcting mechanism. For example, believing that after five coin flips result in heads, the next flip is more likely to be tails. This is false. The Statistical Truth (The Law of Large Numbers): The true law states that as the number of trials increases, the proportion of times an event occurs will get closer to its theoretical probability. Crucial Insight: The law only governs the long run. Individual trials are independent. Past results do not influence future independent trials. A fair coin has a 50% chance of heads on every single flip, regardless of prior results. Key Consequence: The absolute difference between the expected number of occurrences and the actual number of occurrences usually increases as the number of trials grows, even as the proportional difference shrinks.
105
Box Model
The Box Model is a conceptual tool used in probability and statistics to represent the structure of a chance process, such as drawing tickets, sampling, or repeated independent trials (like coin flips or dice rolls). Purpose The box model simplifies a complex real-world problem into a manageable statistical framework, allowing you to calculate the expected value and the standard error of the sum, average, or count of the draws. 1. What numbers go into the box? 2. How many of each kind? 3. How many draws total?
106
Law of Large Numbers (LLN)
Definition: A fundamental theorem in probability that states that as the number of independent trials or observations increases, the average (or proportion) of the observed outcomes will tend to get closer and closer to the expected value (or theoretical probability). Intuition: The law formalizes the idea of the long run. It means that while the outcome of any single random event is unpredictable, the results of many repeated trials follow a predictable, stable pattern dictated by probability. The randomness of individual events tends to cancel out over time. Key Insight: The LLN only concerns the proportion or average of events. It does not mean that the absolute number of heads and tails in a coin flip experiment will get closer together. In fact, the absolute difference usually grows, but it becomes insignificant when viewed as a proportion of the total trials. Contrast with Law of Averages: The LLN is the correct statistical principle, while the Law of Averages is the fallacy that suggests a streak must be corrected in the short term. The LLN proves that for independent trials, past results do not influence future results.