Statistics Flashcards

Question

Question (True/False): In statistics, p-value represents the probability that the null hypothesis is true.

Answer 1

False p-value is calculated under the assumption that H0 is true, and a low p-value indicates that the observed data is unlikely if H0 were true For example: H0 should represent the assumption you are testing. In your candy example: Suppose H0: “Red candy is the most frequent in the bag.” You sample 50 candies: 30 blue, 10 orange, 10 red. The observed result (more blue than red) is very unlikely under H0, so the p-value is low. Low p-value → reject H0 → evidence that red is not the most frequent.

Answer 2

B Descriptive statistics summarize or describe data features, e.g., mean, median, mode, standard deviation, variance. They do not infer or predict beyond the data. In ML, descriptive stats are often used in exploratory data analysis (EDA) to understand feature distributions and scales.

Answer 3

No problem! Here’s the explanation: * * **PDF (Probability Density Function):** * Shows the **relative likelihood** of a random variable taking a specific value. * The area under the curve between two points gives the **probability** of the variable falling in that interval. * Example: In a normal distribution, the PDF shows the familiar bell-shaped curve. * **CDF (Cumulative Distribution Function):** * Shows the **probability that a random variable is less than or equal to a certain value**. * It is the **cumulative sum** of probabilities from the PDF. * Example: If X is a test score, CDF(70) gives the probability that a student scores **70 or less**. ✅ Key idea: PDF tells “how likely each value is,” CDF tells “how likely the value is up to this point.”

Answer 4

False Not change shape, just scale the data so mean =0 and std =1 If the original distribution is skewed, standardization does not make it normal. The shape stays the same; only the scale and location change. standardizing features preserve the shape and the relationship, help the gradient smoother, help model base on distance or gradient (e.g., SVM, KNN, gradient descent) to converge faster and perform better.

Answer 5

B * **Covariance** measures how two variables **change together**: * Positive covariance → variables increase together * Negative covariance → one increases while the other decreases * **Correlation** is a standardized version of covariance, ranging from -1 to 1, making it easier to interpret. * In ML, covariance/correlation is used for **feature analysis** and **reducing multicollinearity**.

Answer 6

Law of Large Numbers: As the sample size increases, the sample mean *converges to* (close to) the population mean. Importance in ML: Using more data generally improves estimates of statistics (mean, variance) and model reliability. Helps justify why large datasets lead to better generalization.

Answer 7

* **Discrete Random Variable:** * Takes **countable values** (often integers). * Example: Rolling a dice → possible outcomes {1, 2, 3, 4, 5, 6} * In ML: number of clicks, number of items purchased. * **Continuous Random Variable:** * Takes **any value within a range** (uncountable, often real numbers). * Example: Height, weight, temperature * In ML: pixel intensity in images, sensor readings, regression targets. ✅ Key idea: Discrete = countable, Continuous = measurable/any value in range.

Answer 8

* In a skewed distribution: * **Mean** is pulled toward the long tail (sensitive to extreme values). * **Median** is more robust and represents the **middle value**. * In ML, median is often preferred for **robust statistics** when data contains outliers.

Answer 9

C * **Categorical variables** represent **qualitative data**, e.g., types, labels, or categories. * In ML, categorical variables often need **encoding** (one-hot, label encoding) before feeding into models. * Examples: Fruit type, gender, color, product category.

Answer 10

Overdispersion: When the variance of a dataset is greater than the mean. Common in count data (e.g., number of clicks, arrivals). Importance: Standard models like Poisson regression assume mean ≈ variance. If overdispersion exists, Poisson may underestimate variance, leading to biased estimates and misleading significance tests. Solutions: Use Negative Binomial regression or adjust standard errors.

Answer 11

B Standardization (z-score) rescales data to have mean 0 and standard deviation 1 without changing the distribution shape. Other methods like log or Box-Cox change the shape to reduce skewness or make it closer to normal. Min-max normalization rescales data but can compress or stretch the distribution, slightly affecting relative distances.

Answer 12

* The **expected value** represents the **weighted average** of all possible outcomes of a random variable, using their probabilities as weights. **Example:** * Rolling a fair six-sided dice: expected value = (1×1/6 + 2×1/6 + 3×1/6 + 4×1/6 + 5×1/6 + 6×1/6) = 3.5 ✅ Key idea: It’s the **long-term average outcome** if the experiment is repeated many times.

Answer 13

✅ True. **Explanation:** * In a **normal distribution** (bell curve): * ~68% of data lies within **±1 standard deviation** of the mean * ~95% within ±2 standard deviations * ~99.7% within ±3 standard deviations (the **68-95-99.7 rule**) Type **next** for the next flashcard.

Answer 14

A * Regularization (L1/L2) **controls model complexity** to reduce **variance** and prevent **overfitting**. * High variance → model fits training data too closely but generalizes poorly. * Regularization adds a **penalty term** to shrink weights, balancing **bias-variance trade-off**.

Answer 15

* **Multicollinearity:** When two or more features are highly correlated. * **Problems it causes in linear regression:** * Coefficients become **unstable** and sensitive to small changes in data. * Hard to **interpret feature importance** because effects are shared across correlated variables. * Can inflate **standard errors**, making significance tests unreliable. * **Solutions:** Remove/reduce correlated features, use **PCA**, or apply **regularization**.

Answer 16

* **Correlation coefficient (r):** * Range: -1 to 1 * **+1:** Perfect positive linear relationship * **-1:** Perfect negative linear relationship * **0:** No linear relationship * In ML, correlation helps identify **redundant features** or **relationships between variables**. Type **next** for the next flashcard.

Answer 17

* **Parametric models:** * Assume a **specific form** for the underlying data distribution (fixed number of parameters). * Easier to interpret, faster to train, but may be wrong if the assumption is wrong. * **Example in ML:** Linear regression, Logistic regression, Gaussian Naive Bayes. * **Non-parametric models:** * Make **no strict assumptions** about the data distribution. * More flexible, can capture complex patterns, but require more data. * **Example in ML:** K-Nearest Neighbors (KNN), Decision Trees, Kernel Density Estimation.

Answer 18

* Larger sample size → reduces **standard error** → makes estimates **more precise**. * This increases the **statistical power**, meaning you’re more likely to detect small but real effects. * Important in ML when evaluating feature significance or model improvements.

Answer 19

B * **Median** is the middle value of a dataset and is **less affected by extreme values**. * Other robust statistics include **interquartile range (IQR)**. * In ML, robust statistics are useful for **feature scaling** or summarizing **skewed data** with outliers.

Answer 20

* **Independent:** Each data point does **not depend** on any other data point. * Example: Rolling a dice multiple times; each roll is independent. * **Identically Distributed:** All data points come from the **same probability distribution**. * Example: All dice rolls are fair and have the same probability for each face. * **i.i.d. (independent and identically distributed):** Most ML algorithms assume that training and test data are i.i.d. * Ensures that patterns learned on training data **generalize** to unseen data. * Violations (e.g., time series data with temporal dependency) may require special handling. **Non-i.i.d. examples:** - Stock prices (dependent on previous prices → not independent) - Time series of weather (distribution changes over seasons → not identically distributed) - Social network data (friends influence each other → not independent)

Answer 21

* **Probability** (specifically, the **posterior probability**) of an event. **Explanation:** * Bayes’ theorem formula: * Posterior = (Likelihood × Prior) / Evidence * In ML, used in: * **Naive Bayes classifier** * **Bayesian inference** for updating beliefs as new data arrives * Key idea: It helps **combine prior knowledge with observed data** to refine predictions. Type **next** for the next flashcard.

Answer 22

* A high p-value indicates **insufficient evidence to reject the null hypothesis**, but it does **not prove H0 is true**. * It just means the observed data is **consistent with H0**, or we **don’t have strong evidence against it**. * Important in ML/statistics: always interpret p-values **as evidence against H0**, not as proof for H0.

Answer 23

**1. Probability** * Symbol: **P(data | parameters)** * Meaning: “The probability of observing this data, **given a model or parameters**.” * Example: Coin is fair (p = 0.5). Probability of 3 heads in 5 flips = P(3 heads | p = 0.5) **2. Likelihood** * Symbol: **L(parameters | data)** * Meaning: “The likelihood of a set of parameters **given the observed data**.” **Example:** * You flip a coin 5 times and observe **3 heads**. * **Likelihood function:** L(p | 3 heads) * Here, **p** (probability of heads) is **variable**, and the data (3 heads in 5 flips) is **fixed**. * You can compute likelihood for different values of p: * L(0.2) = probability of getting 3 heads in 5 flips if p = 0.2 * L(0.5) = probability of getting 3 heads in 5 flips if p = 0.5 * L(0.8) = probability of getting 3 heads in 5 flips if p = 0.8 ✅ Maximum Likelihood Estimation (MLE) chooses the **p value** that **maximizes L(p | data)**. * In this example, MLE would pick **p = 0.6**, because it maximizes the chance of observing 3 heads out of 5. ✅ Key point: * Probability treats parameters as **fixed** and data as **variable**. * Likelihood treats data as **fixed** and parameters as **variable** (used for estimation, e.g., maximum likelihood estimation).

Answer 24

B **Example:** Linear Regression on House Prices * **Model:** Predict house price based on size. * **Residuals:** Difference between actual price and predicted price. * Suppose: | House Size | Actual Price | Predicted Price | Residual | | ---------- | ------------ | --------------- | -------- | | 100 m² | 200k | 195k | 5k | | 120 m² | 250k | 248k | 2k | | 90 m² | 180k | 182k | -2k | | 110 m² | 210k | 215k | -5k | * If you plot **residuals**, they should roughly follow a **normal distribution** centered around 0. * This satisfies the linear regression assumption and allows **statistical inference** (like t-tests for coefficients).

Answer 25

False * Large variance in features **does not always improve performance**. * High variance can: * Cause numerical instability in algorithms like gradient descent * Lead to **dominance of features** with larger scales * That’s why **feature scaling** (standardization or normalization) is often necessary.

Answer 26

**Question (Conceptual):** What is the difference between **sample variance** and **population variance**, and why do we divide by (n-1) instead of n when computing sample variance? **Answer:** * **Population variance:** Measures the variance of the **entire population**. Divide by **N** (total population size). * **Sample variance:** Measures the variance of a **subset (sample)** of the population. Divide by **n-1** instead of n. **Why n-1:** * Using n-1 corrects for **bias** in estimating the population variance from a sample. * Called **Bessel’s correction**. * Intuition: The sample mean is already estimated from the data, so we lose **one degree of freedom**, making n-1 more accurate. ✅ Key idea: Sample variance is a **better unbiased estimator** of population variance.

Answer 27

In a **Poisson distribution**, the mean and variance are ________. **Answer:** * **Equal.** **Explanation:** * Poisson distribution models **count data** (e.g., number of arrivals, clicks, events in a fixed interval). * If λ is the average rate: * Mean = λ * Variance = λ * Important in ML/statistics: If the observed variance is **much larger than the mean**, it indicates **overdispersion**, and Poisson may not be suitable.

Statistics Flashcards

(52 cards)