Probability and Statistics Refresher Flashcards

Question

Define E[X] for discrete and continuous X.

Answer 1

Discrete: E[X]=Σ_x x·p(x). Continuous: E[X]=∫_{−∞}^{∞} x·f(x) dx (if integral exists).

Answer 2

Discrete: E[g(X)]=Σ_x g(x)p(x). Continuous: E[g(X)]=∫ g(x)f(x)dx. You don't need the distribution of Y=g(X) to compute its expectation.

Answer 3

Var(X)=E[(X−E[X])^2]. SD(X)=√Var(X). Shortcut: Var(X)=E[X^2]−(E[X])^2.

Answer 4

E[aX+bY+c]=aE[X]+bE[Y]+c. No independence is required. Linearity always holds when expectations exist.

Answer 5

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y). If X,Y independent (or just uncorrelated, Cov=0), then Var(X+Y)=Var(X)+Var(Y).

Answer 6

Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y]. Correlation ρ=Cov(X,Y)/(σ_X σ_Y) in [−1,1]. They measure linear association (not general dependence).

Answer 7

Events: A,B independent if P(A∩B)=P(A)P(B) (equiv. P(A|B)=P(A)). RVs X,Y independent if for all measurable sets A,B: P(X∈A, Y∈B)=P(X∈A)P(Y∈B). Equivalent: joint pdf/pmf factorizes: f_{X,Y}(x,y)=f_X(x)f_Y(y).

Answer 8

In general, no: uncorrelated does not imply independent. Exception: if (X,Y) are jointly normal (multivariate normal), then Cov(X,Y)=0 implies independence. Counterexample idea: let X~Uniform(−1,1), Y=X^2. Then Cov(X,Y)=0 but dependence is strong.

Answer 9

1) Define the experiment and Ω clearly. 2) Define the event A precisely (often as a set or inequality). 3) Choose a tool: counting (equally likely), PMF/PDF/CDF, conditioning, complements, or total probability. 4) Compute carefully; sanity-check: result in [0,1], compare to bounds, consider extreme cases.

Answer 10

P(A)=|A|/|Ω|. Key: 'equally likely' is an assumption about the model; justify it (symmetry, randomization mechanism).

Answer 11

Use complements when A is complicated but its complement is simple. Common patterns: 'at least one' vs 'none'; 'at most' vs 'more than'; unions of many overlapping events. Compute P(A)=1−P(Aᶜ).

Answer 12

Condition when the problem has hidden structure: different cases, sequential stages, or dependence. Pick B that makes A easier under each case, then use P(A)=Σ P(A|B_i)P(B_i) (total probability).

Answer 13

Single trial success/failure (coin flip, click/no click).

Answer 14

# successes in n independent Bernoulli trials (A/B conversions in n users).

Answer 15

# trials until first success (how many customers until first purchase).

Answer 16

# trials (or failures) until r successes (how many requests until r errors).

Answer 17

# successes in n draws WITHOUT replacement (quality control sampling from a finite lot).

Answer 18

# events in time/space for rare independent events with constant rate (arrivals, defects).

Answer 19

Waiting time between Poisson events; memoryless lifetime model.

Answer 20

Waiting time until k-th Poisson event; sum of exponentials.

Answer 21

Sums/averages of many small effects; measurement noise; CLT.

Answer 22

Products of many positive factors; income/size distributions.

Answer 23

All values equally likely on an interval; naive noninformative model.

Answer 24

Distribution over probabilities p in [0,1]; prior/posterior for Bernoulli/Binomial.

Answer 25

Distribution over probability vectors; prior/posterior for Multinomial.

Answer 26

X~Bernoulli(p) takes values {0,1}. PMF: P(X=1)=p, P(X=0)=1−p. E[X]=p. Var(X)=p(1−p).

Answer 27

Let S=Σ xi. Log-likelihood: ℓ(p)=S log p + (n−S) log(1−p). MLE: p̂ = S/n (sample mean). If prior p~Beta(α,β), posterior is Beta(α+S, β+n−S). MAP (α,β>1): p_MAP = (α+S−1)/(α+β+n−2).

Answer 28

X~Binomial(n,p) = # successes in n independent Bernoulli(p). PMF: P(X=k)=C(n,k) p^k (1−p)^{n−k}, k=0..n. E[X]=np. Var(X)=np(1−p).

Answer 29

Single observation k: likelihood ∝ p^k (1−p)^{n−k}. MLE: p̂ = k/n. With multiple Binomial samples (n_i,k_i): p̂ = (Σ k_i)/(Σ n_i). With Beta(α,β) prior: posterior Beta(α+Σk_i, β+Σ(n_i−k_i)). MAP: (α+Σk_i−1)/(α+β+Σn_i−2) (when α,β>1).

Answer 30

Convention A (trials until first success): support {1,2,...}, P(X=k)=(1−p)^{k−1}p. Convention B (failures before first success): support {0,1,2,...}, P(X=k)=(1−p)^k p. Always check which convention a source/problem uses.

Answer 31

Using support {1,2,...}: E[X]=1/p, Var(X)=(1−p)/p^2. Memoryless: P(X>m+n | X>m)=P(X>n). (Same idea as exponential.)

Answer 32

Let X be the trial count when the r-th success occurs (support {r,r+1,...}). P(X=k)=C(k−1, r−1) p^r (1−p)^{k−r}. Reason: among first k−1 trials, choose positions of r−1 successes; k-th trial is success.

Answer 33

Finite population size N with K successes (and N−K failures). Draw n items *without replacement*. X=# successes in sample. P(X=k)= [C(K,k) C(N−K, n−k)] / C(N,n). Support: max(0, n−(N−K)) ≤ k ≤ min(n,K).

Answer 34

X~Poisson(λ): P(X=k)=e^{−λ} λ^k/k!, k=0,1,2,... E[X]=λ. Var(X)=λ. Often models counts in a fixed interval when events occur with constant rate and independence assumptions.

Answer 35

Let S=Σ xi. Log-likelihood: ℓ(λ)= S log λ − nλ + const. MLE: λ̂ = S/n (sample mean). Prior λ~Gamma(α,β) with pdf ∝ λ^{α−1} e^{−βλ}. Posterior: Gamma(α+S, β+n). MAP (α+S>1): (α+S−1)/(β+n).

Answer 36

f(x)=1/(b−a) for a≤x≤b else 0. E[X]=(a+b)/2. Var(X)=(b−a)^2/12.

Answer 37

Support x≥0. PDF: f(x)=λ e^{−λx}. CDF: F(x)=1−e^{−λx}. Memoryless: P(X>s+t | X>s)=P(X>t).

Answer 38

Likelihood: L(λ)=λ^n exp(−λ Σx_i). MLE: λ̂ = n / Σx_i = 1 / x̄. With prior λ~Gamma(α,β) (rate β): posterior Gamma(α+n, β+Σx_i). MAP: (α+n−1)/(β+Σx_i) (requires α+n>1).

Answer 39

Support x≥0. PDF: f(x)= β^α/Γ(α) · x^{α−1} e^{−βx}. Mean: α/β. Variance: α/β^2. Special cases: α=1 gives Exponential(β). Sum of α exponentials (integer α) gives Gamma.

Answer 40

PDF: f(x)= (1/√(2πσ²)) exp(−(x−μ)²/(2σ²)). E[X]=μ, Var(X)=σ².

Answer 41

MLE μ̂ = x̄. MLE σ̂² = (1/n) Σ (x_i−x̄)². (Note: unbiased sample variance uses 1/(n−1), not MLE.)

Answer 42

Support p∈[0,1]. PDF: f(p)= (1/B(α,β)) p^{α−1}(1−p)^{β−1}. Mean: α/(α+β). Var: αβ/[(α+β)²(α+β+1)].

Answer 43

If p~Beta(α,β) and data are Bernoulli/Binomial with S successes and F failures: Posterior p|data ~ Beta(α+S, β+F). Reason: likelihood contributes p^S(1−p)^F; multiplying by prior keeps the same functional form.

Answer 44

Counts (X1,...,Xk) with Σ Xi = n and probabilities p_i (Σ p_i=1). P(X1=x1,...,Xk=xk)= n!/(∏ x_i!) ∏ p_i^{x_i}. Models n independent draws into k categories.

Answer 45

Dirichlet density on p=(p1..pk): f(p) ∝ ∏ p_i^{α_i−1} on simplex. If p~Dir(α) and counts x from Multinomial, posterior is Dir(α_i + x_i). Interpret α_i as 'pseudo-counts'.

Answer 46

If X~Binomial(n,p) with n large, p small, and λ=np fixed, then X ≈ Poisson(λ). Intuition: many trials, rare success, constant expected count.

Answer 47

If np and n(1−p) are both 'large' (common rule: ≥5 or ≥10), then X ≈ Normal(np, np(1−p)). Continuity correction: P(X≤k) ≈ Φ((k+0.5−np)/√(np(1−p))).

Answer 48

For large λ (rule of thumb: λ≥10 or 20), X~Poisson(λ) ≈ Normal(λ, λ). Use continuity correction if approximating discrete probabilities.

Answer 49

If X|λ ~ Poisson(λ) and λ ~ Gamma(α,β), then the marginal distribution of X is Negative Binomial. Interpretation: overdispersion—random rate λ inflates variance vs mean compared to pure Poisson.

Answer 50

If X_i are i.i.d. with finite mean μ and variance σ², then the standardized sum (ΣX_i − nμ)/(σ√n) converges in distribution to N(0,1) as n→∞ (CLT). So averages of many small effects become approximately normal.

Answer 51

A random sample usually means i.i.d. observations X1,...,Xn drawn according to the same population distribution. Randomness comes from a known randomization mechanism: each unit has a known chance of selection (e.g., simple random sampling) and measurements are not systematically biased. Key threats: selection bias, nonresponse, dependence (clustered sampling), measurement bias.

Answer 52

SRS without replacement: choose n units uniformly from N without replacement. If population has K successes, the sample success count X follows Hypergeometric(N,K,n). If N is large relative to n, Hypergeometric ≈ Binomial(n, p=K/N).

Answer 53

With replacement, draws are (approximately) independent. # successes in n draws follows Binomial(n,p) exactly (under the model).

Answer 54

When sampling without replacement from a finite population, the variance of the sample mean is smaller. If sampling fraction f = n/N is not negligible (e.g., >5–10%), use FPC: √((N−n)/(N−1)). For estimating a mean: SE( x̄ ) ≈ (σ/√n)·√((N−n)/(N−1)).

Answer 55

If X1,...,Xn are i.i.d. with mean μ, then the sample mean x̄ converges to μ as n grows. LLN is about convergence of averages (consistency), not about the distribution of x̄ for finite n.

Answer 56

For i.i.d. variables with finite mean μ and finite variance σ²: Z = (x̄ − μ)/(σ/√n) → N(0,1) in distribution as n→∞. In practice: independence (or weak dependence), not-too-heavy tails, and sufficient n. If variance is infinite (heavy tails), classical CLT may fail.

Answer 57

Exactly: x̄ ~ Normal(μ, σ²/n). No approximation needed because linear combinations of normals are normal.

Answer 58

If Xi ~ Normal(μ,σ²) and σ² unknown, define sample mean x̄ and sample SD s. t = (x̄ − μ)/(s/√n) follows Student t with df = n−1. This is why t-intervals/tests are exact for normal populations.

Answer 59

If Xi ~ Normal(μ,σ²), then (n−1)s²/σ² ~ χ²_{n−1}. Used for confidence intervals and tests about σ².

Answer 60

If U~χ²_{d1}/d1 and V~χ²_{d2}/d2 are independent, then U/V ~ F_{d1,d2}. In normal samples, ratios of sample variances lead to F tests.

Answer 61

Mean μ=E[X]. Variance σ²=E[(X−μ)²]. Skewness γ1 = E[(X−μ)³]/σ³ (asymmetry). Excess kurtosis γ2 = E[(X−μ)⁴]/σ⁴ − 3 (tail heaviness vs normal).

Answer 62

The p-quantile q_p satisfies F(q_p) ≥ p and F(q_p−) ≤ p. For continuous strictly increasing F: q_p = F^{-1}(p). Median is q_0.5.

Answer 63

MGF: M_X(t)=E[e^{tX}] (when finite near t=0). Derivatives at 0 give moments: M'(0)=E[X], M''(0)=E[X²], etc. MGFs help prove distributional identities and sums of independent RVs: M_{X+Y}=M_X·M_Y.

Answer 64

φ_X(t)=E[e^{itX}]. Always exists (bounded by 1), unlike MGF. Uniquely determines distribution; products correspond to sums of independent RVs. Used in advanced limit theorems.

Answer 65

Answer 66

Prior: beliefs about parameter θ before data. Likelihood: how plausible data are under θ. Posterior: updated beliefs after seeing data; balances prior and likelihood. As n grows, likelihood often dominates (under regularity), but prior still matters with small data or weak signal.

Answer 67

A prior is conjugate to a likelihood if the posterior is in the same family as the prior. Benefits: closed-form posterior updates (fast, interpretable), easy predictive calculations. Examples: Beta–Bernoulli/Binomial, Gamma–Poisson/Exponential, Normal–Normal (known variance), Dirichlet–Multinomial.

Answer 68

Credible interval: P(θ ∈ [a,b] | data)=0.95 (probability about parameter given data). Confidence interval: procedure that yields intervals that contain the true fixed parameter θ in 95% of repeated samples. CI is NOT '95% probability θ is in this interval' under strict frequentist interpretation.

Answer 69

Frequentist: parameter θ is fixed (unknown); data are random due to sampling. Bayesian: parameter θ is random with a prior; data are observed; uncertainty about θ is modeled probabilistically.

Answer 70

Given data D and parameter θ, the likelihood is L(θ)=p(D|θ). MLE: θ̂_MLE = argmax_θ L(θ) = argmax_θ log L(θ). Interpretation: choose parameter value that makes observed data most probable under the model.

Answer 71

MAP: θ̂_MAP = argmax_θ p(θ|D). Using Bayes: p(θ|D) ∝ p(D|θ)p(θ). So MAP maximizes log-likelihood + log-prior. If prior is flat (constant), MAP=MLE.

Answer 72

Linear regression with Gaussian noise: maximizing likelihood is minimizing SSE. Add L2 penalty (ridge): minimize SSE + λ||β||² → equivalent to MAP with Gaussian prior β~N(0, τ²I). Add L1 penalty (lasso): minimize SSE + λ||β||₁ → equivalent to MAP with Laplace (double-exponential) prior on β. Regularization encodes prior beliefs about coefficient size/sparsity.

Answer 73

For an estimator \hat{f}(x), expected squared error at x: E[(\hat{f}(x)−f(x))²] = Bias(\hat{f}(x))² + Var(\hat{f}(x)) + noise. Regularization can increase bias but reduce variance, often lowering total error.

Answer 74

T(X) is sufficient for θ if the data provide no extra information about θ beyond T. Factorization theorem: p(x|θ)=g(T(x),θ)h(x). Example: for Bernoulli/Binomial, S=Σx_i is sufficient for p.

Answer 75

Model: y = Xβ + ε. Common assumptions for classical inference: ε ~ N(0, σ²I), independent, homoscedastic. Then β̂_OLS = argmin ||y−Xβ||². Normality is mainly needed for exact t/F inference; OLS still minimizes SSE without it.

Answer 76

If X has full column rank, β̂ = (XᵀX)^{-1} Xᵀ y. If not full rank, use pseudoinverse / regularization.

Answer 77

β_j is the expected change in y for a 1-unit increase in x_j, holding other predictors fixed. Interpretation depends on model validity, scaling, interactions, and collinearity.

Answer 78

Under normal errors, each coefficient estimate β̂_j has a sampling distribution. Test H0: β_j=0 uses t = (β̂_j−0)/SE(β̂_j), which follows t with df = n−p. SE depends on σ̂² and (XᵀX)^{-1}.

Answer 79

Compares explained vs unexplained variance or compares nested models. H0: restricted model is true (some coefficients =0). F = ((RSS_restricted − RSS_full)/q) / (RSS_full/(n−p_full)). Under normal errors, F ~ F_{q, n−p_full}.

Answer 80

With Gaussian likelihood (ε normal): Often use β|σ² ~ Normal(β0, σ² V0) and σ² ~ Inverse-Gamma(a0,b0) (or Normal–Inverse-Gamma jointly). Posterior is also Normal–Inverse-Gamma with closed-form updates. Predictive distribution is Student-t.

Answer 81

A CI is a random interval procedure I(D) computed from data D. A 95% CI satisfies: P_θ( θ ∈ I(D) ) = 0.95 over repeated samples from the model. After observing data, the interval is fixed; θ is fixed (frequentist).

Answer 82

CI ≈ estimate ± (critical value) × (standard error). Critical value depends on desired confidence level and sampling distribution (z, t, etc.). SE depends on variance and sample design. This template comes from a pivotal quantity or asymptotic normality.

Answer 83

MOE = (critical value) × SE. Higher confidence → larger critical value → larger MOE. Larger n → smaller SE (typically ∝1/√n) → smaller MOE. More variability (σ or p(1−p)) → larger SE → larger MOE.

Answer 84

If x̄ ~ Normal(μ, σ²/n) and σ known: CI: x̄ ± z_{α/2} · (σ/√n). For 95%: z_{0.025}≈1.96.

Answer 85

CI: x̄ ± t_{α/2, df=n−1} · (s/√n). Uses sample SD s and Student-t critical value. Exact if population is normal; robust-ish with moderate n by CLT.

Answer 86

Let p̂ = X/n. Approx CI: p̂ ± z_{α/2} √(p̂(1−p̂)/n). Fails when n small or p near 0/1 (coverage poor). Better alternatives: Wilson score, Agresti–Coull, exact (Clopper–Pearson).

Answer 87

Want MOE = z_{α/2} σ/√n ≤ E. Solve: n ≥ (z_{α/2} σ / E)². If σ unknown, use a pilot estimate or conservative bound.

Answer 88

MOE ≈ z_{α/2} √(p(1−p)/n) ≤ E ⇒ n ≥ z_{α/2}² p(1−p)/E². If p unknown, worst-case p=0.5 maximizes p(1−p)=0.25 ⇒ n ≥ z²·0.25/E².

Answer 89

Higher confidence uses a larger critical value (z or t), so the interval becomes wider. Tradeoff: more confidence about containing θ, but less precision (wider range).

Answer 90

Use sample SD s as an estimator. If population is normal (or n moderate/large), use t-based methods: t = (x̄−μ)/(s/√n). For non-normal small n, consider bootstrap CIs or nonparametric methods.

Answer 91

1) If the data-generating mechanism is approximately additive noise from many small sources (physics/measurement). 2) If you're working with averages/sums and n is large (CLT) — but watch out for heavy tails/outliers. 3) If diagnostics support it (QQ plot, residual plots) and tests aren't overly sensitive. In regression, normality of errors affects p-values/interval exactness more than unbiasedness.

Answer 92

H0: baseline/default claim (often 'no effect', 'no difference'). H1: competing claim (effect/difference exists, or direction specified). Rejecting H0 means the data are sufficiently inconsistent with H0 under the chosen test at significance α. It does NOT prove H1; it indicates evidence against H0.

Answer 93

Type I: reject H0 when H0 is true. Probability = α (significance level). Type II: fail to reject H0 when H1 is true. Probability = β. Power = 1−β: probability of detecting a true effect of a specified size.

Answer 94

p-value = P(test statistic at least as extreme as observed | H0 is true). Misinterpretation: 'probability H0 is true'. That is not what a p-value is. Small p-value indicates data would be unlikely under H0 (given model assumptions).

Answer 95

1) Specify H0 and H1. 2) Choose a test statistic T and its distribution under H0 (exact or asymptotic). 3) Choose α (or compute p-value). 4) Compute T from data. 5) Reject H0 if p≤α (or if T in rejection region). 6) Report effect size + CI, assumptions, and practical significance.

Answer 96

Two-sided: H1 allows deviations in either direction (μ≠μ0). Default when direction isn't pre-specified. One-sided: H1 specifies direction (μ>μ0 or μ<μ0). Use only if direction is scientifically justified *before* seeing data. One-sided tests have more power in that direction but ignore the other.

Answer 97

If you test many hypotheses, the chance of at least one false positive increases. Family-wise error rate (FWER) can be controlled by Bonferroni or Holm. False discovery rate (FDR) can be controlled by Benjamini–Hochberg. In ML research, multiple metrics or hyperparameter sweeps can inflate false discoveries.

Answer 98

H0: μ=μ0. Test statistic: z=(x̄−μ0)/(σ/√n). Under H0 (normal or CLT), z ~ N(0,1). Reject for |z|>z_{α/2} (two-sided) or appropriate one-sided cutoff.

Answer 99

H0: μ=μ0. Statistic: t=(x̄−μ0)/(s/√n). If population normal: t ~ t_{n−1} under H0. Otherwise, approximate for moderate/large n.

Answer 100

Goal: test H0: μ1−μ2 = Δ0 (often 0). Use statistic based on (x̄1−x̄2−Δ0)/SE. If variances assumed equal: pooled t-test. If variances not assumed equal: Welch's t-test (default in practice).

Answer 101

Welch does not assume equal variances; it uses SE = √(s1²/n1 + s2²/n2) and an approximate df (Welch–Satterthwaite). If variances are equal, Welch performs nearly as well; if not, it protects Type I error better. So it's a safer default unless you have strong evidence of equal variances.

Answer 102

Use paired t-test when observations are naturally matched (before/after, twin studies, same subject under two conditions). Compute differences d_i = x_i − y_i and run a one-sample t-test on d_i. Pairing removes between-subject variability, increasing power.

Answer 103

CI: (x̄1−x̄2) ± t_{α/2, df} · √(s1²/n1 + s2²/n2), with df given by Welch–Satterthwaite approximation. Use when samples independent and variances may differ.

Answer 104

Let d_i = x_i − y_i, with mean d̄ and SD s_d, n pairs. CI for μ_d: d̄ ± t_{α/2, n−1} · (s_d/√n). Then infer μ1−μ2 via μ_d.

Answer 105

H0: p=p0. With X~Binomial(n,p0): z = (p̂−p0)/√(p0(1−p0)/n) ≈ N(0,1) for large n. Use if np0 and n(1−p0) are large; otherwise consider exact binomial test.

Answer 106

H0: p1=p2. Use pooled estimate p̂ = (x1+x2)/(n1+n2). z = (p̂1−p̂2)/√( p̂(1−p̂)(1/n1+1/n2) ). Large-sample approx; report CI for p1−p2 as well.

Answer 107

Tests whether observed counts match expected counts under a model. Statistic: χ² = Σ_i (O_i − E_i)² / E_i over categories i. Under H0 (large sample): χ² ~ χ²_{df}, where df = (#categories − 1 − #estimated parameters). Rule: expected counts E_i should be sufficiently large (often ≥5).

Answer 108

H0: row and column variables are independent. Expected count in cell (i,j): E_{ij} = (row_i total)(col_j total)/grand total. Statistic: χ² = Σ_{i,j} (O_{ij}−E_{ij})²/E_{ij}. df=(r−1)(c−1). Large-sample approximation; use Fisher's exact for small counts (2×2).

Answer 109

When sample sizes are small or expected counts are low (common rule: any expected count <5). Fisher’s exact computes exact p-value under fixed margins using the hypergeometric distribution. It controls Type I error without large-sample approximations.

Answer 110

H0: σ1²=σ2². Statistic: F = s1²/s2² (often put the larger variance on top). Under H0 with normality and independence: F ~ F_{n1−1, n2−1}. Sensitive to non-normality; Levene/Brown–Forsythe are more robust.

Answer 111

Compares means across ≥3 groups. H0: μ1=μ2=...=μk (all group means equal). Uses an F statistic comparing between-group variance to within-group variance.

Answer 112

F = MS_between / MS_within. MS_between = SS_between/(k−1), MS_within = SS_within/(N−k). Under H0 with normal errors: F ~ F_{k−1, N−k}.

Answer 113

A permutation test evaluates a null hypothesis by comparing the observed test statistic to its distribution under label shuffling. Useful when parametric assumptions (normality, equal variance) are questionable. Common in ML to test if model A truly outperforms model B using paired predictions or metrics.

Answer 114

Bootstrap resamples the observed dataset with replacement to approximate the sampling distribution of an estimator. Used to estimate standard errors, bias, and confidence intervals (percentile, BCa, etc.). Works well when i.i.d. assumption is reasonable; adapt for time series (block bootstrap).

Answer 115

t-tests: efficient when assumptions roughly hold (approx normality of mean differences). Bootstrap: good for SE/CI of complex statistics (medians, AUC) when i.i.d. Permutation: strong for hypothesis tests of 'no difference' under exchangeability; fewer distributional assumptions. In ML, paired designs + permutation are often safer for metric differences.

Answer 116

p̂ = X/n. E[p̂]=p. Var(p̂)=p(1−p)/n. By CLT/normal approximation, p̂ ≈ Normal(p, p(1−p)/n) when np and n(1−p) are large.

Answer 117

Joint: f_{X,Y}(x,y) (pmf or pdf). Marginal: f_X(x)=∫ f_{X,Y}(x,y) dy (or Σ_y for discrete). Conditional: f_{X|Y}(x|y)= f_{X,Y}(x,y)/f_Y(y) when f_Y(y)>0. Same ideas extend to higher dimensions.

Answer 118

Compute marginals f_X and f_Y. If for all (x,y): f_{X,Y}(x,y) = f_X(x) f_Y(y), then X ⟂ Y. Equivalently, conditional equals marginal: f_{X|Y}(x|y)=f_X(x).

Answer 119

X ∈ ℝ^d is MVN if every linear combination aᵀX is univariate normal. Notation: X ~ N(μ, Σ). Key: in MVN, uncorrelated components (Cov=0) are independent. More generally, block-diagonal Σ ⇒ block independence.

Answer 120

If [X;Y] ~ N([μ_X;μ_Y], [[Σ_XX, Σ_XY],[Σ_YX, Σ_YY]]), then X|Y=y ~ N( μ_X + Σ_XY Σ_YY^{-1}(y−μ_Y), Σ_XX − Σ_XY Σ_YY^{-1} Σ_YX ). Used in Gaussian conditioning / Kalman filters / GP regression.

Answer 121

If y=g(x) is 1–1 with inverse x=g^{-1}(y), then f_Y(y) = f_X(g^{-1}(y)) · |d/dy g^{-1}(y)|. In multivariate case, multiply by |det(Jacobian of inverse)|.

Answer 122

Frequentist: plug-in estimate θ̂ and compute p(x_new|θ̂); uncertainty handled via sampling distributions/intervals. Bayesian: integrate over posterior uncertainty: p(x_new|D)=∫ p(x_new|θ)p(θ|D)dθ. Bayesian predictive usually yields wider (more honest) uncertainty when data are limited.

Answer 123

E[X|Y] is a random variable (a function of Y) giving the best mean-squared-error predictor of X given Y. Intuition: average of X among cases where Y is fixed. Key property: E[X|Y]=g(Y) for some function g.

Answer 124

E[X] = E[ E[X|Y] ]. More generally: E[ E[X|Y,Z] | Z ] = E[X|Z]. Think: averaging in stages gives the same overall average.

Answer 125

Var(X) = E[ Var(X|Y) ] + Var( E[X|Y] ). Interpretation: total variability = average within-group variability + variability of group means.

Answer 126

If X≥0 and a>0, then P(X ≥ a) ≤ E[X]/a. Useful for bounding tail probabilities with only the mean known.

Answer 127

For any RV with mean μ and variance σ²: P(|X−μ| ≥ kσ) ≤ 1/k². Gives a distribution-free concentration bound; often loose but very general.

Answer 128

If φ is convex, then φ(E[X]) ≤ E[φ(X)]. Used throughout statistics/ML (e.g., deriving ELBOs, proving log-likelihood concavity).

Answer 129

P(A1∩...∩An) = P(A1) P(A2|A1) P(A3|A1∩A2) ... P(An|A1∩...∩A_{n−1}). For random variables: f(x1,...,xn)=f(x1)f(x2|x1)...f(xn|x1..x_{n−1}).

Answer 130

X and Y are conditionally independent given Z if f_{X,Y|Z}(x,y|z)=f_{X|Z}(x|z) f_{Y|Z}(y|z) for all z. Equivalently: P(X∈A, Y∈B | Z)=P(X∈A|Z)P(Y∈B|Z).

Answer 131

1) Plot: scatter/heatmap; check for nonlinear patterns. 2) Correlation: tests only linear dependence (Pearson). Use Spearman/Kendall for monotonic. 3) Categorical: chi-square test of independence. 4) Continuous: mutual information estimates, HSIC, distance correlation; or compare f_{X,Y} to f_X f_Y via density estimates. 5) Model-based: fit conditional model Y|X and test if X adds predictive power. No finite-sample method can prove independence; you gather evidence under assumptions.

Answer 132

H(X)= −Σ_x p(x) log p(x). Measures uncertainty / average information. Maximized by uniform distribution over a finite support.

Answer 133

h(X)= −∫ f(x) log f(x) dx. Unlike discrete entropy, h(X) can be negative and is not invariant to reparameterization. Mutual information (difference of entropies) remains meaningful.

Answer 134

If Z ~ χ²_k, then Z ~ Gamma(shape= k/2, rate=1/2). Equivalently Gamma(shape=k/2, scale=2). This relationship helps compute moments and connect to exponential family.

Answer 135

If Z~N(0,1) and U~χ²_ν independent, then T = Z / √(U/ν) ~ t_ν. In sampling: replacing σ by s introduces the χ² term, leading to t.

Answer 136

If U~χ²_{d1} and V~χ²_{d2} independent, then F = (U/d1)/(V/d2) ~ F_{d1,d2}.

Answer 137

A counting process N(t) with: 1) N(0)=0. 2) Independent increments. 3) Stationary increments: N(t+s)−N(s) ~ Poisson(λt). 4) For small h: P(1 event in h)≈λh and P(≥2 in h)=o(h). Then N(t) ~ Poisson(λt).

Answer 138

Interarrival times are i.i.d. Exponential(λ). Waiting time to the k-th event is Gamma(k, rate=λ). These facts connect Poisson counts to exponential/gamma waiting times.

Answer 139

Mean uses all values and is sensitive to outliers. Median is the 50th percentile and is robust to extreme values. Median preferred for skewed/heavy-tailed distributions (income, response times) or when outliers are common.

Answer 140

s² = (1/(n−1)) Σ (x_i−x̄)² is unbiased for σ² when data are i.i.d. with finite variance. Using x̄ (estimated from data) 'consumes' one degree of freedom, reducing the effective information by 1.

Answer 141

z = (x−mean)/SD. Standardization puts variables on a common scale (mean 0, SD 1), aiding comparison, numerical stability, and interpretation (especially in ML).

Answer 142

Correlation ≠ causation. Confounding variables can induce correlation. Nonlinear relationships may have near-zero correlation. Outliers can dominate correlation. Restriction of range can reduce correlation. Simpson’s paradox: aggregated data can reverse trends seen in subgroups.

Answer 143

Compares two independent samples without assuming normality. Tests whether one distribution tends to produce larger values than the other (often interpreted as a shift in location). Use for skewed data, ordinal outcomes, or when outliers/non-normality make t-test unreliable.

Answer 144

Nonparametric alternative to paired t-test. Tests whether the median of paired differences is 0 (with symmetry assumptions). Useful when paired differences are not approximately normal.

Answer 145

Compares a sample CDF to a reference CDF (one-sample) or compares two sample CDFs (two-sample). Test statistic: max absolute CDF difference. Sensitive to differences in distribution shape; works for continuous distributions (ties complicate).

Answer 146

An A/A test splits traffic into two identical variants. Used to validate instrumentation, randomization, and false positive rates. If A/A shows systematic differences or too many 'significant' results, your experiment pipeline may be biased.

Answer 147

1) Significance level α. 2) Desired power 1−β. 3) Effect size of practical interest (difference in means/proportions). 4) Variability (σ) or baseline rate p. 5) Test type (two-sided vs one-sided) and design (paired vs independent). Then solve for n (often numerically).

Answer 148

d = (x̄1−x̄2)/s_pooled (or use s from a relevant reference). It’s a standardized effect size, making results comparable across scales. Always interpret in context; 'small/medium/large' rules are domain dependent.

Answer 149

1) One sample vs two samples vs paired. 2) If σ known and normal/large n: one-sample z. 3) If σ unknown: use t. 4) Two independent groups: Welch t (default). Pooled t only if equal variances justified. 5) Matched pairs: paired t on differences. 6) Strong non-normality/outliers/small n: consider rank tests or permutation/bootstrap. Always report effect size + CI and check assumptions (independence, measurement).

Answer 150

1) Single proportion vs two proportions vs multi-category. 2) Small n or extreme p: exact binomial test. 3) Large n: z tests/CI for one or two proportions. 4) Multi-category GOF: chi-square GOF. 5) Contingency independence: chi-square independence; if low expected counts (esp 2×2), use Fisher exact. Assumptions: independent observations, correct expected counts model.

Answer 151

MLE: p̂_i = x_i / n. With Dirichlet prior p~Dir(α1..αk), posterior Dir(α_i + x_i). MAP (all α_i>1): p_MAP,i = (α_i + x_i − 1) / (Σ_j α_j + n − k).

Answer 152

Posterior μ|x is Normal with: Precision additivity: 1/τ_post² = 1/τ² + n/σ². Mean: μ_post = τ_post²( μ0/τ² + n x̄ / σ² ). MAP = posterior mean (Normal is symmetric/unimodal).

Answer 153

Likelihood is constant for θ ≥ max(x_i) and 0 otherwise. MLE: θ̂ = max(x_i). It is biased downward: E[max] = n/(n+1) θ. Unbiased estimator: (n+1)/n · max(x_i).

Answer 154

If both α and β unknown, MLEs usually have no closed-form for α. Common approach: estimate β = α/x̄, then solve for α numerically using digamma equation involving log x̄ − average log x. In practice: use numerical optimization or method of moments as initialization.

Answer 155

Posterior is Beta(α+S, β+F). Predictive success probability for next Bernoulli is the posterior mean: P(X_new=1|data)= (α+S)/(α+β+S+F). This is 'add-α, add-β' smoothing (Laplace smoothing when α=β=1).

Answer 156

Posterior Dir(α_i + x_i). Predictive: P(next=i | data)= (α_i + x_i) / (Σ_j α_j + n). This yields additive smoothing; prevents zero probabilities when α_i>0.

Answer 157

P(data|θ) is a probability as a function of data for fixed θ. Likelihood L(θ)=P(data|θ) viewed as a function of θ for fixed observed data. Likelihood values don't have to sum/integrate to 1 over θ; only relative values matter for MLE/MAP.

Answer 158

For many parameters, a two-sided α-level test rejects H0:θ=θ0 iff θ0 is outside the (1−α) confidence interval. Example: reject μ=μ0 at α=0.05 iff μ0 not in the 95% CI for μ.

Answer 159

1) Random sampling / independence (often the most important). 2) Correct model for sampling distribution (normality for exact t/F, large-sample approximations for z/χ²). 3) For pooled t / ANOVA: equal variances (homoscedasticity). 4) For χ²: expected cell counts not too small. Violations can inflate Type I error or reduce power.

Answer 160

Classical inference relies on error assumptions (linearity, constant variance, independence, sometimes normality). Check: residual vs fitted (nonlinearity/heteroscedasticity), QQ plot (normality), leverage/influence (outliers), autocorrelation (time series). In ML, diagnostics help decide transformations, robust methods, or different models.

Answer 161

Heteroscedasticity: Var(ε|X) not constant. OLS coefficients remain unbiased (under exogeneity) but usual SEs and t/F inference can be wrong. Robust (sandwich) SEs estimate variance without assuming constant error variance.

Answer 162

Y is Lognormal(μ,σ²). Median: exp(μ). Mean: exp(μ + σ²/2). Used for positive quantities formed by multiplicative effects.

Answer 163

Let I_A be 1 if event A occurs else 0. Then E[I_A]=P(A). For counts, write the count as sum of indicators and use linearity: E[Σ I_i]=Σ E[I_i]. Great for expected # of successes, collisions, etc., even with dependence.

Answer 164

Question: among n people, what's P(at least two share a birthday)? Illustrates complement rule and dependence between pair events. Compute P(no match) = 365/365 × 364/365 × ... × (365−n+1)/365; then 1−that.

Answer 165

Host's action depends on your initial choice (not independent). Condition on the host revealing a goat: switching wins with probability 2/3. Lesson: update probabilities using conditional information and the procedure generating the evidence.

Answer 166

A trend appearing in several groups can reverse when groups are combined. Caused by confounding / different group sizes. Lesson: stratify by important variables; be cautious interpreting aggregated correlations or success rates.

Answer 167

If posterior is skewed or multimodal. MAP picks the highest-density point (mode) and can sit on a boundary. Posterior mean averages over uncertainty and can lie in low-density regions for multimodal posteriors. Choose estimator based on loss function (mean for squared error, median for absolute error, MAP for 0–1 loss on a grid).

Probability and Statistics Refresher Flashcards

(191 cards)