What is the Central Limit Theorem (CLT) and why is it important?
The CLT states that the sampling distribution of the sample mean approximates a normal distribution as the sample size gets larger (usually n≥30), regardless of the population’s original distribution. It is crucial because it allows us to use normal probability models to conduct hypothesis tests and build confidence intervals on non-normal data.
Explain Type I and Type II errors with an example.
Type I Error (False Positive): Rejecting the null hypothesis when it is actually true (e.g., a spam filter flagging a legitimate email as spam). α represents the probability of a Type I error.Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false (e.g., a spam filter letting a malicious email through to your inbox). β represents the probability of a Type II error.
Explain Bayes’ Theorem and its components.
Bayes’ Theorem updates our belief in a hypothesis based on new evidence. The formula is $P(A
What is the difference between Covariance and Correlation?
Both measure the linear relationship between two variables. Covariance indicates the direction of the relationship but is not standardized, meaning it can take any value from −∞ to ∞. Correlation is the standardized version of covariance, scaled strictly between −1 and 1, indicating both the direction and the strength of the relationship.
What does a “95% Confidence Interval” actually mean?
It means that if we were to draw 100 different random samples from the population and compute a confidence interval for each, we expect approximately 95 of those intervals to contain the true population parameter. It does not mean there is a 95% probability that the true parameter lies within one specific, already calculated interval.
When would you use a Poisson distribution vs. a Binomial distribution?
Use Binomial for a fixed number of discrete trials with two possible outcomes (e.g., number of clicks on an ad shown to 1,000 users). Use Poisson for counting the number of events occurring in a continuous, fixed interval of time or space, where events happen independently at a constant rate (e.g., number of customers arriving at a store per hour).
Explain Maximum Likelihood Estimation (MLE) in simple terms.
MLE is a method for estimating the parameters of a statistical model. It finds the specific parameter values that make the observed data most probable to have occurred. It essentially asks: “Given the data we just saw, which model parameters are most likely to have produced it?”
What are the four main assumptions of Linear Regression?
What is A/B Testing, and how do you determine if the results are significant?
A/B testing is a randomized experiment comparing two versions of a variable (A and B) to determine which performs better. You determine statistical significance by calculating a p-value using a statistical test (like a t-test or Z-test). If the p-value is lower than your predetermined significance level (α, usually 0.05), you reject the null hypothesis and conclude the difference is statistically significant.
What is statistical power?
Statistical power is 1−β (where β is the probability of a Type II error). It is the probability that a statistical test will correctly reject a false null hypothesis. Simply put, it’s the test’s ability to detect a true effect or difference when one actually exists.
What is the difference between the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT)?
The LLN states that as your sample size grows, the sample mean will get closer to the true population mean. The CLT states that as the sample size grows, the distribution of those sample means will look like a normal distribution (bell curve), regardless of the original data’s shape.
What is Simpson’s Paradox? Provide an example.
A phenomenon where a clear trend appears in separate groups of data but disappears or reverses when the groups are combined. Example: A hospital looks like it has a higher overall mortality rate than a clinic, but when broken down by “severe” vs “mild” cases, the hospital actually has a lower mortality rate for both groups. It highlights the danger of hidden confounding variables.
Explain the Bias-Variance Tradeoff.
Bias is the error from overly simplistic assumptions (leads to underfitting). Variance is the error from being too sensitive to small fluctuations in the training data (leads to overfitting). The tradeoff is the balancing act of finding a model complex enough to capture patterns, but simple enough to generalize to new data.
What is the difference between R2 and Adjusted R2?
R2 measures the proportion of variance in the target variable explained by the model. However, R2 always goes up when you add more features, even useless ones. Adjusted R2 penalizes the model for adding useless predictors, and will only increase if the new feature improves the model more than expected by random chance.
What are “Degrees of Freedom” in simple terms?
The number of independent pieces of information that went into calculating an estimate. Simply put, it’s the number of values in your data that are “free to vary” after certain constraints are applied (often calculated as n−1).
When should you use non-parametric tests instead of parametric tests?
Use parametric tests (like t-tests or ANOVA) when your data follows a specific distribution (usually normal) and is continuous. Use non-parametric tests (like Mann-Whitney U or Kruskal-Wallis) when your data is heavily skewed, has massive outliers, or consists of ordinal (ranked) categories.
What is ANOVA (Analysis of Variance) used for?
ANOVA is used to test if there is a statistically significant difference between the means of three or more independent groups. It does this by comparing the variance between the different groups to the variance within each individual group.
How do outliers affect the Mean, Median, and Mode?
The Mean is highly sensitive to outliers and gets pulled heavily in their direction. The Median is robust and largely unaffected, as it only looks at the middle sorted value. The Mode is generally unaffected by extreme numeric outliers.
What is Survivorship Bias?
A logical error where you only analyze the people or things that “survived” a process, ignoring those that failed because they are no longer visible. For example, looking only at successful startups to find the “secret to success,” while ignoring that 90% of failed startups did the exact same things.
What is the difference between a Z-test and a T-test?
Both test hypotheses about population means. You use a Z-test when the population variance is known and the sample size is large (n≥30). You use a T-test when the population variance is unknown or the sample size is small (n<30), because it relies on the “Student’s t-distribution,” which has fatter tails to account for the extra uncertainty.