Statistics Flashcards

(52 cards)

1
Q

What is variance and how does it affect model performance?

A
  • Variance measures how much a set of values deviates from their mean.
  • In ML, model variance refers to how much a model’s predictions change with different training data.
  • High variance: Model fits training data very closely → overfitting, poor generalization on unseen data.
  • Low variance: Model predictions are stable across datasets but may underfit if too simple.

Key point: Variance is central to the bias-variance tradeoff:

> High variance → overfitting; High bias → underfitting. Random Forest reduces variance via averaging multiple trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is standard deviation and how is it related to variance? How it help in ML

A
  • Standard deviation (SD) is the square root of variance, giving a measure of spread in the same units as the data.
    [
    SD = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}}
    ]
  • Relationship: Variance quantifies spread in squared units; SD makes it interpretable in original units.
  • Use in ML: SD helps detect outliers, normalize features, and understand feature dispersion.

Key point: SD and variance are fundamental for scaling, normalization, and understanding model input distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is covariance, and how is it different from correlation?

A

Covariance: Measures how two variables vary together.

Cov(X, Y) = (1/n) * Σ[(X_i - mean(X)) * (Y_i - mean(Y))]
  • Positive → X and Y increase together
  • Negative → One increases while the other decreases
  • Zero → No linear relationship

Correlation: Normalized covariance, ranges [-1, 1]

r = Cov(X, Y) / (SD(X) * SD(Y))
  • Unitless → easier to compare relationships
  • Covariance depends on units; correlation does not

Key point: Covariance shows joint variability; correlation shows strength and direction of linear relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a probability distribution, and why is it important in ML?

A
  • Probability distribution: Describes how values of a random variable are spread and their likelihoods.
  • Types commonly used in ML:
    • Discrete: e.g., Bernoulli, Binomial
    • Continuous: e.g., Normal (Gaussian), Uniform, Exponential
  • Importance in ML:
    • Helps model data, make predictions, and estimate uncertainty.
    • Foundation for statistical inference, Bayesian methods, and loss function design (e.g., MSE assumes Gaussian errors).

Key point: Understanding the data distribution is essential for choosing models, preprocessing, and evaluating uncertainty.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between a population and a sample?

A
  • Population: The entire set of items or observations of interest.
  • Sample: A subset of the population used to make inferences.

Why it matters in ML:

  • Models are trained on samples, but we aim to generalize to the population.
  • Sampling introduces variance and bias, which must be considered in model evaluation.

Key point: Understanding the difference helps in estimating errors, confidence intervals, and generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define bias and variance in machine learning.

A
  • Bias: Error due to wrong assumptions in the model (underfitting).
    • High bias → model too simple, poor training & test performance.
  • Variance: Error due to sensitivity to training data (overfitting).
    • High variance → model performs well on training but poorly on test data.

Bias-Variance Tradeoff:

  • Total error = Bias² + Variance + Irreducible error
  • Goal: balance bias and variance for best generalization.

Key point: Random Forest reduces variance via averaging; deep linear models may reduce bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is skewness and kurtosis in data?

A
  • Skewness: Measures asymmetry of a data distribution.
    • Positive skew → tail on right
    • Negative skew → tail on left
    • Zero skew → symmetric distribution
  • Kurtosis: Measures “tailedness” or how heavy/extreme the tails are.
    • High kurtosis → heavy tails, more outliers
    • Low kurtosis → light tails, fewer extreme values
    • Normal distribution kurtosis ≈ 3

Importance in ML:

  • Skewed features may need transformation for models assuming normality.
  • High kurtosis can affect robustness and loss functions.

Key point: Skewness and kurtosis help understand distribution shape, outliers, and preprocessing needs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a z-score, and why is it useful?

A
  • Definition: Measures how many standard deviations a data point is from the mean.
z = (x - mean) / SD
  • Interpretation:
    • z = 0 → data point equals the mean
    • z > 0 → above the mean
    • z < 0 → below the mean
  • Use in ML:
    • Standardization of features
    • Detecting outliers
    • Useful in distance-based algorithms (k-NN, clustering)

Key point: Z-scores normalize data for better model performance and comparability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  • Explain the Central Limit Theorem (CLT) in a simple way.
A
  • Idea: If you take many random samples from any population and calculate the average of each sample, those averages will form a normal distribution as the number of samples grows.
  • Simple Example:
    1. Take a dice (numbers 1–6). Each roll is not normal; it’s uniform.
    2. Roll it 5 times, calculate the average. Repeat this many times.
    3. Plot all averages → the shape will be bell-shaped (normal).
  • Why it matters in ML:
    • Helps us assume normality for sample means, even if data isn’t normal.
    • Supports confidence intervals, hypothesis tests, and robust model evaluation.

Key point: CLT explains why averages of random samples tend to be predictable and normally distributed, which is very useful for ML statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Central Limit Theorem (CLT) and why is it important in ML?

A
  • Central Limit Theorem: The sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the original population distribution.
  • Importance in ML:
    1. Justifies assuming normality for large sample sizes.
    2. Supports statistical inference, confidence intervals, and hypothesis testing.
    3. Helps in designing robust estimators and understanding sampling variability.

Key point: CLT enables ML practitioners to apply probabilistic reasoning and statistical tests even on non-normal data, as long as sample size is large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is outlier, and why is it important in ML?

A
  • Outlier: A data point that is significantly different from the majority of the data.
  • Causes: Measurement error, rare events, or natural variability.
  • Impact on ML:
    • Can distort mean, variance, and model parameters.
    • Can affect distance-based algorithms (k-NN, clustering).
    • Can lead to overfitting in models sensitive to extreme values.
  • Detection Methods:
    • Z-score (e.g., |z| > 3)
    • IQR method (1.5 × IQR beyond Q1/Q3)
    • Visualizations: boxplots, scatterplots
  • Handling: Remove, cap, or transform outliers depending on context.

Key point: Identifying and handling outliers improves model accuracy and robustness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does applying log or other transformations not change the relationship between features and target?

A
  • Core idea: Transformations rescale or compress data but preserve the order of values.
    • E.g., if house A > house B in size, after log: log(A) > log(B) → ranking stays the same.
  • Effect on relationships:
    • Linear relationships in transformed space may approximate non-linear relationships in original space.
    • Helps linear models fit patterns better, but the underlying correlation is preserved.
  • Example:
    • Original sizes: 50, 100, 200, 1000 → target prices
    • Log sizes: log(50), log(100), log(200), log(1000) → preserves order
    • Model now sees more evenly spaced data, less dominated by outliers

Key point: Transformations improve model fit and reduce skewness without reversing or breaking relationships between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are residuals in regression, and why are they important?

A
  • Definition: Residual = difference between the actual value and the predicted value by the model.
residual = actual_value - predicted_value
  • Purpose:
    1. Measures model error for each observation
    2. Helps diagnose model fit — patterns in residuals indicate problems
    3. Assumptions in linear regression:
      • Residuals should be normally distributed
      • Mean = 0
      • Homoscedasticity (constant variance)
  • Key point: Residual analysis is essential for evaluating model performance and assumptions, especially in linear models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the assumptions of linear regression?

A
  1. Linearity: Relationship between features and target is linear.
  2. Independence: Observations are independent of each other.
  3. Homoscedasticity: Residuals have constant variance across all predicted values.
  4. Normality of residuals: Residuals are approximately normally distributed.
  5. No multicollinearity: Features are not highly correlated with each other.
  6. No autocorrelation: Residuals should not be correlated (especially in time series).

Key point: Violating these assumptions can lead to biased, inefficient, or invalid estimates, so diagnostic checks and preprocessing are essential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is multicollinearity a problem in linear regression? Give an example.

A
  • Problem: When two or more features are highly correlated, the model cannot distinguish their individual effects on the target.
  • Effects:
    1. Unstable coefficients → small changes in data cause large swings in estimates
    2. High standard errors → low statistical significance
    3. Hard to interpret which feature truly impacts the target
  • Example:
    • Predicting house price using:
      • Feature 1: house_size (m²)
      • Feature 2: number_of_rooms
    • These are highly correlated → model cannot tell which feature contributes more to price.
    • Coefficients may flip signs or vary greatly if dataset changes slightly.

Key point: Multicollinearity does not reduce predictive power much for tree-based models, but for linear regression it destabilizes coefficients and interpretation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

You notice two features in your dataset are highly correlated. Why is this a problem for linear regression, and how can you fix it?

A
  • Problem: Multicollinearity
    • Makes coefficient estimates unstable and hard to interpret
    • Increases standard errors, reducing statistical significance
  • Solutions:
    1. Remove one of the correlated features
    2. Combine features (e.g., via PCA)
    3. Regularization (Ridge or Lasso regression) to reduce impact of collinearity

Key point: Handling multicollinearity improves model stability, interpretability, and prediction accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is feature scaling, and why is it important in ML?

A
  • Feature scaling: Process of normalizing or standardizing feature values to a common scale.
  • Common methods:
    1. Min-Max Scaling: Scales features to [0, 1]
      X_scaled = (X - X_min) / (X_max - X_min)
    2. Standardization (Z-score): Mean 0, SD 1
      X_scaled = (X - mean(X)) / SD(X)
  • Importance:
    • Algorithms sensitive to feature scale: gradient descent, k-NN, SVM, neural networks
    • Prevents dominance of large-scale features over small-scale features
    • Improves convergence speed and stability

Key point: Scaling ensures features contribute equally to model learning and improves training efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why are some algorithms (gradient descent, k-NN, SVM, neural networks) sensitive to feature scale?

A
  1. Gradient Descent (used in linear/logistic regression, neural networks):
    • Gradient updates depend on feature values.
    • Large-scale features produce large gradients, small-scale features produce small gradients → optimization becomes unbalanced, slows convergence, may oscillate.
    • Scaling ensures equal contribution and faster, stable convergence.
  2. k-Nearest Neighbors (k-NN):
    • Distance-based algorithm (Euclidean, Manhattan).
    • Features with larger scales dominate distance computation, small-scale features are ignored.
    • Scaling ensures all features equally affect distance.
  3. Support Vector Machines (SVM):
    • SVM finds the maximum margin hyperplane using distances.
    • Features with larger scales skew distances → hyperplane biased toward large-scale features.
    • Scaling ensures fair margin calculation.
  4. Neural Networks:
    • Inputs with large variance can cause activation outputs to saturate (sigmoid/tanh) or slow learning (gradient issues).
    • Standardized inputs improve learning speed and stability.

Key point: Feature scaling prevents domination by large-scale features, ensures balanced learning, and improves model convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s the difference between normalization and standardization, and when should we use each?

A
  1. Normalization (Min–Max Scaling)
    • Equation:
      x' = (x - x_min) / (x_max - x_min)
    • Scales values to a fixed range [0, 1].
    • Preserves the shape of the distribution but changes its scale.
    • Use when:
      • Features have different units or scales.
      • Algorithms rely on distances (e.g., k-NN, SVM, neural networks).
  1. Standardization (Z-score Scaling)
    • Equation:
      x' = (x - mean) / standard_deviation
    • Centers data around mean = 0 and standard deviation = 1.
    • Keeps outliers but reduces their impact.
    • Use when:
      • Data is roughly Gaussian.
      • Algorithms assume normally distributed data (e.g., linear regression, logistic regression, PCA).

Example:
If height ranges from 150–200 cm and weight from 40–120 kg:

  • Normalization → both features in [0,1].
  • Standardization → both centered around 0 with unit variance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which of the following best describes the variance of a dataset?
A) The average of all data points
B) The square root of the mean
C) The average of the squared differences from the mean
D) The difference between the maximum and minimum values

A

C
Variance measures how much the data points spread out from the mean. It is calculated as the average of the squared differences between each data point and the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Question (True/False):
In machine learning, a high correlation between two features always implies that one feature causes the other.

A

False
True correlation does not imply causation.

Two features can be highly correlated due to a third hidden factor, coincidence, or data bias.

In ML, high correlation between features may indicate redundancy, which could affect model performance (e.g., multicollinearity in linear models).

22
Q

The central limit theorem states that, given a sufficiently large sample size, the sampling distribution of the sample mean will be approximately ________, regardless of the population’s distribution.

A

The sampling distribution of the sample mean will be approximately normally distributed, regardless of the population’s distribution.

Explanation:

  • The mean of this sampling distribution will equal the population mean.
  • This is important in ML and statistics for confidence intervals and hypothesis testing.
23
Q

Which of the following is an example of a probability distribution commonly used in machine learning?

A) Confusion matrix
B) Normal distribution
C) Decision tree
D) Gradient descent

A

B
Normal distribution is a continuous probability distribution widely used in ML for modeling data, assumptions in algorithms (like linear regression), and initializing weights in neural networks.

Other common distributions: Bernoulli, Binomial, Poisson, Exponential.

24
Q

p-value meaning, why low p-value can reject null hypothesis (H0)

A
  1. Definition reminder:
    • p-value = probability of observing the data (or something more extreme) if the null hypothesis H0 is true.
  2. Logic:
    • H0 assumes there is no effect or no difference.
    • If H0 is true, extreme results should be rare.
    • A low p-value means the observed data is very unlikely under H0.
  3. Decision rule:
    • If p-value < significance level (α, usually 0.05), the data is too unlikely under H0.
    • Therefore, we conclude: “H0 probably isn’t true,” and we reject H0.
  4. Important nuance:
    • Rejecting H0 does not prove H1; it just suggests H0 is inconsistent with the observed data.
    • P-value is a measure of evidence against H0, not the probability of H0 being true.

Analogy:

  • Suppose H0 says “the coin is fair.”
  • You flip it 100 times and get 95 heads.
  • If the coin were fair, getting 95 heads is extremely unlikely (very low p-value).
  • So you reject H0 and suspect the coin is biased.
25
Question (True/False): In statistics, p-value represents the probability that the null hypothesis is true.
False p-value is calculated under the assumption that H0 is true, and a low p-value indicates that the observed data is unlikely if H0 were true For example: H0 should represent the assumption you are testing. In your candy example: Suppose H0: “Red candy is the most frequent in the bag.” You sample 50 candies: 30 blue, 10 orange, 10 red. The observed result (more blue than red) is very unlikely under H0, so the p-value is low. Low p-value → reject H0 → evidence that red is not the most frequent.
26
Question (Multiple Choice): Which of the following is an example of a descriptive statistic used in machine learning? A) Linear regression coefficient B) Mean and standard deviation of a dataset C) p-value from hypothesis testing D) Gradient of a loss function
B Descriptive statistics summarize or describe data features, e.g., mean, median, mode, standard deviation, variance. They do not infer or predict beyond the data. In ML, descriptive stats are often used in exploratory data analysis (EDA) to understand feature distributions and scales.
27
Question (Conceptual): What is the difference between a probability density function (PDF) and a cumulative distribution function (CDF)?
No problem! Here’s the explanation: * * **PDF (Probability Density Function):** * Shows the **relative likelihood** of a random variable taking a specific value. * The area under the curve between two points gives the **probability** of the variable falling in that interval. * Example: In a normal distribution, the PDF shows the familiar bell-shaped curve. * **CDF (Cumulative Distribution Function):** * Shows the **probability that a random variable is less than or equal to a certain value**. * It is the **cumulative sum** of probabilities from the PDF. * Example: If X is a test score, CDF(70) gives the probability that a student scores **70 or less**. ✅ Key idea: PDF tells “how likely each value is,” CDF tells “how likely the value is up to this point.”
28
Question (True/False): In machine learning, standardizing features (subtract mean, divide by standard deviation) changes the shape of the data distribution.
False Not change shape, just scale the data so mean =0 and std =1 If the original distribution is skewed, standardization does not make it normal. The shape stays the same; only the scale and location change. standardizing features preserve the shape and the relationship, help the gradient smoother, help model base on distance or gradient (e.g., SVM, KNN, gradient descent) to converge faster and perform better.
29
Question (Multiple Choice): Which of the following is a measure of the relationship between two continuous variables? A) Variance B) Covariance C) Standard deviation D) Correlation
B * **Covariance** measures how two variables **change together**: * Positive covariance → variables increase together * Negative covariance → one increases while the other decreases * **Correlation** is a standardized version of covariance, ranging from -1 to 1, making it easier to interpret. * In ML, covariance/correlation is used for **feature analysis** and **reducing multicollinearity**.
30
Question (Fill in the Blank): In machine learning, the law of large numbers ensures that as the sample size increases, the sample mean will ________ the population mean.
Law of Large Numbers: As the sample size increases, the sample mean *converges to* (close to) the population mean. Importance in ML: Using more data generally improves estimates of statistics (mean, variance) and model reliability. Helps justify why large datasets lead to better generalization.
31
Question (Conceptual): What is the difference between a discrete and a continuous random variable, and give an example of each in machine learning.
* **Discrete Random Variable:** * Takes **countable values** (often integers). * Example: Rolling a dice → possible outcomes {1, 2, 3, 4, 5, 6} * In ML: number of clicks, number of items purchased. * **Continuous Random Variable:** * Takes **any value within a range** (uncountable, often real numbers). * Example: Height, weight, temperature * In ML: pixel intensity in images, sensor readings, regression targets. ✅ Key idea: Discrete = countable, Continuous = measurable/any value in range.
32
Question (True/False): A skewed distribution affects the mean more than the median.
* In a skewed distribution: * **Mean** is pulled toward the long tail (sensitive to extreme values). * **Median** is more robust and represents the **middle value**. * In ML, median is often preferred for **robust statistics** when data contains outliers.
33
Question (Multiple Choice): Which of the following is an example of a categorical variable commonly used in machine learning? A) Age of a person B) Number of purchases C) Type of fruit (apple, banana, orange) D) Temperature in Celsius
C * **Categorical variables** represent **qualitative data**, e.g., types, labels, or categories. * In ML, categorical variables often need **encoding** (one-hot, label encoding) before feeding into models. * Examples: Fruit type, gender, color, product category.
34
Question (Conceptual): What is overdispersion in a dataset, and why is it important in modeling count data in machine learning?
Overdispersion: When the variance of a dataset is greater than the mean. Common in count data (e.g., number of clicks, arrivals). Importance: Standard models like Poisson regression assume mean ≈ variance. If overdispersion exists, Poisson may underestimate variance, leading to biased estimates and misleading significance tests. Solutions: Use Negative Binomial regression or adjust standard errors.
35
Question (Multiple Choice): In machine learning, which of the following scaling methods preserves the shape of the original data distribution? A) Log transformation B) Standardization (z-score) C) Box-Cox transformation D) Min-max normalization to [0,1]
B Standardization (z-score) rescales data to have mean 0 and standard deviation 1 without changing the distribution shape. Other methods like log or Box-Cox change the shape to reduce skewness or make it closer to normal. Min-max normalization rescales data but can compress or stretch the distribution, slightly affecting relative distances.
36
Question (Fill in the Blank): In probability theory, the expected value of a random variable represents the ________ of its possible outcomes, weighted by their probabilities.
* The **expected value** represents the **weighted average** of all possible outcomes of a random variable, using their probabilities as weights. **Example:** * Rolling a fair six-sided dice: expected value = (1×1/6 + 2×1/6 + 3×1/6 + 4×1/6 + 5×1/6 + 6×1/6) = 3.5 ✅ Key idea: It’s the **long-term average outcome** if the experiment is repeated many times.
37
Question (True/False): In a normal distribution, about 68% of the data lies within one standard deviation of the mean.
✅ True. **Explanation:** * In a **normal distribution** (bell curve): * ~68% of data lies within **±1 standard deviation** of the mean * ~95% within ±2 standard deviations * ~99.7% within ±3 standard deviations (the **68-95-99.7 rule**) Type **next** for the next flashcard.
38
Which statistical concept is most directly related to regularization techniques in machine learning (like L1 or L2)? A) Variance B) Correlation C) Skewness D) P-value
A * Regularization (L1/L2) **controls model complexity** to reduce **variance** and prevent **overfitting**. * High variance → model fits training data too closely but generalizes poorly. * Regularization adds a **penalty term** to shrink weights, balancing **bias-variance trade-off**.
39
Question (Conceptual): What is multicollinearity, and why can it be a problem in machine learning models like linear regression?
* **Multicollinearity:** When two or more features are highly correlated. * **Problems it causes in linear regression:** * Coefficients become **unstable** and sensitive to small changes in data. * Hard to **interpret feature importance** because effects are shared across correlated variables. * Can inflate **standard errors**, making significance tests unreliable. * **Solutions:** Remove/reduce correlated features, use **PCA**, or apply **regularization**.
40
Question (Fill in the Blank): The correlation coefficient ranges from ______ to ______ and measures the strength and direction of a linear relationship between two variables.
* **Correlation coefficient (r):** * Range: -1 to 1 * **+1:** Perfect positive linear relationship * **-1:** Perfect negative linear relationship * **0:** No linear relationship * In ML, correlation helps identify **redundant features** or **relationships between variables**. Type **next** for the next flashcard.
41
Question (Conceptual): What is the difference between parametric and non-parametric statistical models, and give an example of each used in machine learning.
* **Parametric models:** * Assume a **specific form** for the underlying data distribution (fixed number of parameters). * Easier to interpret, faster to train, but may be wrong if the assumption is wrong. * **Example in ML:** Linear regression, Logistic regression, Gaussian Naive Bayes. * **Non-parametric models:** * Make **no strict assumptions** about the data distribution. * More flexible, can capture complex patterns, but require more data. * **Example in ML:** K-Nearest Neighbors (KNN), Decision Trees, Kernel Density Estimation.
42
**Question (True/False):** In hypothesis testing, increasing the **sample size** generally makes it **easier to detect small effects**.
* Larger sample size → reduces **standard error** → makes estimates **more precise**. * This increases the **statistical power**, meaning you’re more likely to detect small but real effects. * Important in ML when evaluating feature significance or model improvements.
43
**Question (Multiple Choice):** Which of the following is an example of a **robust statistic** that is less sensitive to outliers? A) Mean B) Median C) Variance D) Standard deviation
B * **Median** is the middle value of a dataset and is **less affected by extreme values**. * Other robust statistics include **interquartile range (IQR)**. * In ML, robust statistics are useful for **feature scaling** or summarizing **skewed data** with outliers.
44
Question (Conceptual): What is the difference between independent and identically distributed (i.i.d.) data, and why is this assumption important in machine learning?
* **Independent:** Each data point does **not depend** on any other data point. * Example: Rolling a dice multiple times; each roll is independent. * **Identically Distributed:** All data points come from the **same probability distribution**. * Example: All dice rolls are fair and have the same probability for each face. * **i.i.d. (independent and identically distributed):** Most ML algorithms assume that training and test data are i.i.d. * Ensures that patterns learned on training data **generalize** to unseen data. * Violations (e.g., time series data with temporal dependency) may require special handling. **Non-i.i.d. examples:** - Stock prices (dependent on previous prices → not independent) - Time series of weather (distribution changes over seasons → not identically distributed) - Social network data (friends influence each other → not independent)
45
Question (Fill in the Blank): In probability, Bayes’ theorem allows us to update the ________ of an event based on new evidence.
* **Probability** (specifically, the **posterior probability**) of an event. **Explanation:** * Bayes’ theorem formula: * Posterior = (Likelihood × Prior) / Evidence * In ML, used in: * **Naive Bayes classifier** * **Bayesian inference** for updating beliefs as new data arrives * Key idea: It helps **combine prior knowledge with observed data** to refine predictions. Type **next** for the next flashcard.
46
Question (True/False): A high p-value indicates strong evidence in favor of the null hypothesis.
* A high p-value indicates **insufficient evidence to reject the null hypothesis**, but it does **not prove H0 is true**. * It just means the observed data is **consistent with H0**, or we **don’t have strong evidence against it**. * Important in ML/statistics: always interpret p-values **as evidence against H0**, not as proof for H0.
47
Question (Conceptual): What is the difference between probability and likelihood in statistics and machine learning?
**1. Probability** * Symbol: **P(data | parameters)** * Meaning: “The probability of observing this data, **given a model or parameters**.” * Example: Coin is fair (p = 0.5). Probability of 3 heads in 5 flips = P(3 heads | p = 0.5) **2. Likelihood** * Symbol: **L(parameters | data)** * Meaning: “The likelihood of a set of parameters **given the observed data**.” **Example:** * You flip a coin 5 times and observe **3 heads**. * **Likelihood function:** L(p | 3 heads) * Here, **p** (probability of heads) is **variable**, and the data (3 heads in 5 flips) is **fixed**. * You can compute likelihood for different values of p: * L(0.2) = probability of getting 3 heads in 5 flips if p = 0.2 * L(0.5) = probability of getting 3 heads in 5 flips if p = 0.5 * L(0.8) = probability of getting 3 heads in 5 flips if p = 0.8 ✅ Maximum Likelihood Estimation (MLE) chooses the **p value** that **maximizes L(p | data)**. * In this example, MLE would pick **p = 0.6**, because it maximizes the chance of observing 3 heads out of 5. ✅ Key point: * Probability treats parameters as **fixed** and data as **variable**. * Likelihood treats data as **fixed** and parameters as **variable** (used for estimation, e.g., maximum likelihood estimation).
48
Question (Multiple Choice): Which of the following distributions is often assumed for the errors/residuals in linear regression? A) Uniform distribution B) Normal (Gaussian) distribution C) Exponential distribution
B **Example:** Linear Regression on House Prices * **Model:** Predict house price based on size. * **Residuals:** Difference between actual price and predicted price. * Suppose: | House Size | Actual Price | Predicted Price | Residual | | ---------- | ------------ | --------------- | -------- | | 100 m² | 200k | 195k | 5k | | 120 m² | 250k | 248k | 2k | | 90 m² | 180k | 182k | -2k | | 110 m² | 210k | 215k | -5k | * If you plot **residuals**, they should roughly follow a **normal distribution** centered around 0. * This satisfies the linear regression assumption and allows **statistical inference** (like t-tests for coefficients).
49
Question (True/False): In machine learning, a larger variance in the features always improves model performance.
False * Large variance in features **does not always improve performance**. * High variance can: * Cause numerical instability in algorithms like gradient descent * Lead to **dominance of features** with larger scales * That’s why **feature scaling** (standardization or normalization) is often necessary.
50
Question (Conceptual): What is the difference between sample variance and population variance, and why do we divide by (n-1) instead of n when computing sample variance?
**Question (Conceptual):** What is the difference between **sample variance** and **population variance**, and why do we divide by (n-1) instead of n when computing sample variance? **Answer:** * **Population variance:** Measures the variance of the **entire population**. Divide by **N** (total population size). * **Sample variance:** Measures the variance of a **subset (sample)** of the population. Divide by **n-1** instead of n. **Why n-1:** * Using n-1 corrects for **bias** in estimating the population variance from a sample. * Called **Bessel’s correction**. * Intuition: The sample mean is already estimated from the data, so we lose **one degree of freedom**, making n-1 more accurate. ✅ Key idea: Sample variance is a **better unbiased estimator** of population variance.
51
Question (Multiple Choice): Which of the following is a non-parametric test commonly used to compare two independent samples? A) Student’s t-test B) ANOVA C) Mann-Whitney U test D) Linear regression
In a **Poisson distribution**, the mean and variance are ________. **Answer:** * **Equal.** **Explanation:** * Poisson distribution models **count data** (e.g., number of arrivals, clicks, events in a fixed interval). * If λ is the average rate: * Mean = λ * Variance = λ * Important in ML/statistics: If the observed variance is **much larger than the mean**, it indicates **overdispersion**, and Poisson may not be suitable.
52
Question (Conceptual): What is the difference between parametric estimation and maximum likelihood estimation (MLE) in statistics?