Experimental Design & A/B Testing Flashcards by O Cam

What is the main goal of an A/B test?

To compare outcomes between a control (A) and a treatment (B) group in a randomized experiment to estimate the causal effect of a change.

How well did you know this?

Not at all

Perfectly

What is randomization in A/B testing?

Assigning units (such as users or sessions) to treatment or control at random so that, on average, the groups are comparable on all other factors.

How well did you know this?

Not at all

Perfectly

Why is randomization critical for causal interpretation?

It breaks systematic links between treatment assignment and confounders, so differences in outcomes can be attributed to the treatment under reasonable assumptions.

How well did you know this?

Not at all

Perfectly

What are common units of randomization in online experiments?

Users, sessions, or requests, chosen so that interference between units is minimized and exposure is clearly defined.

How well did you know this?

Not at all

Perfectly

What is a treatment effect at a high level?

The difference in expected outcomes between the treatment and control conditions for the same population.

How well did you know this?

Not at all

Perfectly

What is uplift or lift in A/B tests?

The relative change in a metric between treatment and control, often expressed as a percentage increase or decrease.

How well did you know this?

Not at all

Perfectly

What is a primary metric in experimental design?

The main outcome of interest that the experiment is aimed at improving, such as conversion rate or retention.

How well did you know this?

Not at all

Perfectly

What are guardrail metrics?

Metrics monitored to ensure that the treatment does not harm important aspects of the product, such as latency or error rates.

How well did you know this?

Not at all

Perfectly

What is a null hypothesis H₀ in the context of an A/B test?

The assumption that there is no difference in the metric between treatment and control (effect size zero).

How well did you know this?

Not at all

Perfectly

What is an alternative hypothesis H₁ in an A/B test?

The claim that there is a nonzero difference (positive or negative) between treatment and control.

How well did you know this?

Not at all

Perfectly

What is a significance level α in experiment design?

A pre-specified maximum probability of rejecting a true null hypothesis (Type I error), often chosen as 0.05.

How well did you know this?

Not at all

Perfectly

What is statistical power in an experiment?

The probability of correctly rejecting the null hypothesis when the treatment truly has the minimum effect size of interest.

How well did you know this?

Not at all

Perfectly

Why is power important when planning experiments?

Low power means a high chance of missing real effects, leading to wasted experimentation and misleading ‘no effect’ conclusions.

How well did you know this?

Not at all

Perfectly

What factors influence the required sample size for a given power?

Baseline metric level, minimum detectable effect size, variance, desired power, and chosen significance level.

How well did you know this?

Not at all

Perfectly

Why does smaller minimum detectable effect size require larger sample size?

Detecting subtler differences requires more data to distinguish them from random noise in metrics.

How well did you know this?

Not at all

Perfectly

What is a two-sample test for proportions in an A/B test?

A statistical test that compares conversion or success rates between treatment and control using binomial or Normal approximations.

How well did you know this?

Not at all

Perfectly

Why are confidence intervals often more informative than just p-values in A/B tests?

Study These Flashcards

They show a range of plausible effect sizes and convey both magnitude and uncertainty, not just significance yes/no.

What is a fixed-horizon (classical) A/B test?

Study These Flashcards

An experiment where the sample size and analysis time are set in advance and results are evaluated only after data collection is complete.

Why does repeatedly checking p-values before the planned sample size inflate Type I error?

Study These Flashcards

Each additional look at the data effectively adds another hypothesis test, increasing the chance of false positives unless corrected.

What is sequential or online testing?

Study These Flashcards

Approaches that allow monitoring results over time with statistical corrections to maintain error control while peeking at data.

What is a simple good practice for stopping rules if you use fixed-horizon tests?

Study These Flashcards

Commit to a sample size and analysis plan in advance and avoid acting on interim p-values unless using a sequential method.

What is a one-sided vs two-sided test in A/B experiments?

Study These Flashcards

A one-sided test evaluates improvement in a specific direction; a two-sided test looks for any difference, positive or negative.

Why are two-sided tests often preferable in product experiments?

Study These Flashcards

They guard against harmful changes by detecting significant decreases as well as increases in the metric.

What is a pre-analysis plan?

Study These Flashcards

A document or specification that defines hypotheses, metrics, sample size, and analysis methods before running the experiment.

How does a pre-analysis plan help avoid p-hacking?

It commits the team to specific analyses, reducing temptation to search for significance through many unplanned tests.

What is multiple testing in the context of experiments?

Running many experiments or testing many metrics/hypotheses, which increases the chance of at least one false positive.

What are simple strategies to mitigate multiple testing issues?

Limiting the number of primary metrics, adjusting p-values (e.g., Bonferroni), or emphasizing effect sizes and replication over single p-values.

Why is it important to distinguish statistical significance from practical significance?

A tiny but statistically significant effect may be irrelevant to the business, while a practically important effect may be nonsignificant if the study is underpowered.

What is sample ratio mismatch (SRM) in A/B tests?

A situation where the observed fraction of users in each group deviates significantly from the planned allocation ratio, often indicating randomization or logging issues.

Why is detecting SRM important?

Because it can signal bugs in assignment, tracking, or eligibility logic that invalidate experiment results.

What is interference between units in experiments?

When the treatment of one unit (e.g., user) affects outcomes of another, violating independence assumptions.

How can interference complicate A/B test interpretation?

It breaks the simple comparison of group averages as causal effects, since spillovers can bias differences in either direction.

What is a crossover or contamination in A/B tests?

When units switch between treatment and control conditions or are exposed to both, blurring group distinctions.

Why should exposure and assignment logs be carefully instrumented?

Accurate logs are essential for understanding who saw what when, diagnosing SRM, and defining correct analysis populations.

What is an intention-to-treat (ITT) analysis?

An analysis that compares groups based on initial random assignment regardless of actual treatment uptake, preserving randomization.

Why is ITT often preferred over per-protocol analysis?

It avoids selection bias introduced by post-assignment behavior and maintains the benefits of randomization.

What is a holdout group in experimentation?

A group of users withheld from any new changes, used as a long-term baseline for monitoring overall system trends.

Why can long-lived holdouts be valuable?

They help distinguish experiment-driven changes from secular trends or seasonal effects in metrics.

In the context of ML, what is offline evaluation vs online A/B testing?

Offline evaluation uses historical data and metrics on test sets; online A/B testing measures real-world impact on live traffic and business outcomes.

Why can an ML model that looks better offline fail in an A/B test?

Offline metrics may not capture feedback effects, selection changes, latency, or user behavior shifts that appear in production.

In one sentence, what is the key mental model for experimental design and A/B testing?

Plan experiments with clear hypotheses, metrics, and sample sizes, randomize carefully, monitor for pathologies like SRM, and interpret results by combining statistical evidence with practical impact.

Experimental Design & A/B Testing Flashcards

(41 cards)