What is the basic goal of hypothesis testing?
To assess whether observed data are compatible with a specified null hypothesis about a population or model parameter.
What is a null hypothesis H₀?
A default assumption or claim about a parameter or distribution, such as ‘no effect’ or ‘no difference’ between groups.
What is an alternative hypothesis H₁?
A competing claim that would be supported if the evidence is strong enough against the null, such as ‘there is a difference’.
What is a test statistic in hypothesis testing?
A function of the data that measures the degree of discrepancy between the observed data and what would be expected under H₀.
What is a p-value?
The probability, under the null hypothesis, of observing a test statistic at least as extreme as the one computed from the data.
Does a small p-value prove that the null hypothesis is false?
No; it indicates that the observed data would be unusual under H₀, but does not provide a direct probability that H₀ is true or false.
What is a significance level α (alpha)?
A chosen threshold such as 0.05, representing the maximum tolerable probability of rejecting H₀ when it is actually true (Type I error).
What is a Type I error?
Rejecting a true null hypothesis (a false positive).
What is a Type II error?
Failing to reject a false null hypothesis (a false negative).
What is statistical power?
The probability of correctly rejecting H₀ when it is false; i.e., 1 − (Type II error rate).
Why is power important in experimental design?
Low power means that even real effects are unlikely to be detected, leading to wasted experiments and misleading ‘no effect’ conclusions.
What factors increase the power of a test?
Larger sample size, larger true effect size, lower variance, and higher significance level (all else equal).
Why is ‘p<0.05’ not a magic threshold?
It is a convention; real decisions should consider effect size, uncertainty, costs of errors, and context, not just a binary cutoff.
What is multiple testing or multiple comparisons?
Running many hypothesis tests, which increases the chance of getting at least one small p-value by random chance even when all nulls are true.
Why is multiple testing a concern in ML and analytics?
Trying many models, features, or metrics and only reporting the best results can lead to overoptimistic conclusions and p-hacking.
What is an A/B test at a high level?
A controlled experiment comparing outcomes between a control group (A) and a treatment group (B) to assess the effect of a change.
What is randomization in A/B testing?
Assigning units to groups at random so that, on average, groups are comparable and confounders are balanced.
Why is randomization critical for causal interpretation?
It breaks systematic links between treatment assignment and other variables, making differences in outcomes attributable to the treatment with fewer assumptions.
What are typical units of randomization in online A/B tests?
Users, sessions, or requests, depending on the product and outcome being measured.
Why is ‘stable unit treatment value’ (roughly) important in experiments?
It assumes one unit’s outcome is not affected by another unit’s treatment, simplifying interpretation; interference can complicate A/B test analysis.
What is a lift in A/B testing?
The relative change in a metric between treatment and control, often expressed as a percentage increase or decrease.
What is the difference between statistical significance and practical significance?
Statistical significance is about whether an effect is unlikely under H₀; practical significance is whether the effect size is large enough to matter in practice.
Why is it dangerous to stop an experiment as soon as p<0.05 without planning?
Repeated peeking at results inflates Type I error; you effectively perform multiple tests without correction.
What is a pre-analysis plan or fixed-horizon test?
A plan that specifies sample size, metrics, and decision rules in advance and analyzes data only after reaching the planned sample size.