Probability and Statistics Refresher Flashcards

(191 cards)

1
Q

What does ‘probability’ mean? Contrast frequentist and Bayesian interpretations.

A

Frequentist: P(A) is the long-run relative frequency of event A in repeated identical trials.
Bayesian: P(A) is a degree of belief (a rational measure of uncertainty) given information.
Both use the same probability calculus (axioms + rules); they differ in interpretation and how parameters are treated (fixed vs random).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define sample space Ω, event A, and outcome ω. Give an example.

A

Sample space Ω: set of all possible outcomes ω of an experiment.
Event A: subset of Ω.
Example: roll a die. Ω={1,2,3,4,5,6}. Event A=’even’={2,4,6}. Outcome ω might be 5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

State the 3 Kolmogorov axioms of probability.

A

1) Non-negativity: for any event A, P(A) ≥ 0.
2) Normalization: P(Ω) = 1.
3) Additivity: for disjoint events A∩B=∅, P(A∪B)=P(A)+P(B). (Countable additivity in full generality.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Derive the complement rule: P(Aᶜ)=1−P(A).

A

Because A and Aᶜ are disjoint and A∪Aᶜ=Ω.
So 1=P(Ω)=P(A∪Aᶜ)=P(A)+P(Aᶜ) ⇒ P(Aᶜ)=1−P(A).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Write P(A∪B) in terms of P(A), P(B), and P(A∩B).

A

Inclusion–exclusion for 2 events:
P(A∪B)=P(A)+P(B)−P(A∩B).
Special case: if A,B disjoint then P(A∩B)=0 and you recover additivity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State inclusion–exclusion for P(A∪B∪C).

A

P(A∪B∪C)=P(A)+P(B)+P(C)
−P(A∩B)−P(A∩C)−P(B∩C)
+P(A∩B∩C).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define conditional probability P(A|B). When is it defined?

A

If P(B)>0, then P(A|B) = P(A∩B)/P(B).
Interpretation: restrict the sample space to B and renormalize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Express P(A∩B) using P(A|B) and P(B) (and the symmetric form).

A

P(A∩B)=P(A|B)P(B)=P(B|A)P(A), assuming the conditioning event has positive probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

State the law of total probability for a partition {B_i}.

A

If {B_i} are disjoint, cover Ω, and P(B_i)>0, then for any event A:
P(A)=Σ_i P(A|B_i)P(B_i).
Think: break A into pieces inside each B_i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

State Bayes’ theorem and interpret each term.

A

Bayes: P(A|B)= P(B|A)P(A) / P(B), with P(B)>0.
P(A)=prior, P(B|A)=likelihood, P(B)=evidence/normalizer, P(A|B)=posterior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Write Bayes’ theorem in odds form for hypotheses H1 vs H0.

A

Posterior odds = prior odds × Bayes factor.
P(H1|D)/P(H0|D) = [P(D|H1)/P(D|H0)] × [P(H1)/P(H0)].
The Bayes factor measures evidence from data D.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is n! and when does it appear in counting?

A

n! = n·(n−1)·…·2·1 (with 0!=1).
Counts the number of ways to order n distinct objects (permutations of length n).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How many ways to arrange r objects chosen from n distinct objects (no repetition)? Why does order matter?

A

Permutations: P(n,r)= n·(n−1)·…·(n−r+1)= n!/(n−r)!.
Order matters because different sequences correspond to different outcomes (e.g., gold/silver/bronze).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How many ways to choose r objects from n distinct objects (no repetition) when order does not matter? Why not?

A

Combinations: C(n,r)= n choose r = n!/[r!(n−r)!].
Order doesn’t matter because selections are sets: {a,b}={b,a}.
You can derive it by dividing permutations by r! (all r! orders represent the same set).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How many distinct permutations of n items with counts n1,n2,…,nk (sum=n)?

A

Multiset permutations: n!/(n1! n2! … nk!).
Reason: start with n! orders, but swapping identical items doesn’t create a new arrangement; divide by each group’s factorial.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How many ways to choose r items from n types with repetition allowed (order irrelevant)?

A

Stars and bars: number of nonnegative integer solutions to x1+…+xn=r is C(n+r−1, r).
Interpret r stars and n−1 bars as separators between item types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

State the binomial theorem and connect it to combinations.

A

(a+b)^n = Σ_{k=0}^n C(n,k) a^{n−k} b^k.
C(n,k) counts ways to choose which k of the n factors contribute a ‘b’ term (order of factors irrelevant).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Use inclusion–exclusion to count |A∪B| and explain why subtraction is needed.

A

|A∪B|=|A|+|B|−|A∩B|.
Adding |A| and |B| double-counts elements in both sets; subtract once to correct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a random variable (RV)? Discrete vs continuous?

A

An RV X is a function from outcomes ω∈Ω to real numbers: X(ω)∈ℝ.
Discrete: takes countable values with PMF p(x)=P(X=x).
Continuous: takes values on intervals; probabilities come from integrals of a PDF f(x), with P(X=x)=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Define the CDF F_X(x). List key properties.

A

F_X(x)=P(X≤x).
Properties: non-decreasing; right-continuous; limits: F(−∞)=0, F(∞)=1.
For any a<b: P(a< X ≤ b)=F(b)−F(a).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain PMF vs PDF and how probabilities are computed in each case.

A

Discrete: PMF p(x)=P(X=x). For set S: P(X∈S)=Σ_{x∈S} p(x).
Continuous: PDF f(x)≥0 with ∫ f(x) dx =1 and P(a≤X≤b)=∫_a^b f(x) dx.
PDF is not a probability at a point; it is a density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are PDF and CDF related for a continuous RV?

A

F(x)=∫_{−∞}^x f(t)dt.
If F is differentiable, f(x)=F’(x).
Probabilities come from area: P(a<X≤b)=F(b)−F(a)=∫_a^b f(t)dt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How does the CDF look for a discrete RV? What do the jumps mean?

A

A discrete CDF is a step function.
Jump size at x equals P(X=x).
Formally: P(X=x)=F(x)−lim_{t→x^-}F(t).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define survival function S(x) and hazard h(x). When are they used?

A

Survival: S(x)=P(X>x)=1−F(x).
Hazard (continuous): h(x)=f(x)/S(x).
Used in time-to-event / reliability / survival analysis; exponential has constant hazard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Define E[X] for discrete and continuous X.
Discrete: E[X]=Σ_x x·p(x). Continuous: E[X]=∫_{−∞}^{∞} x·f(x) dx (if integral exists).
26
State LOTUS: how to compute E[g(X)] without finding distribution of g(X).
Discrete: E[g(X)]=Σ_x g(x)p(x). Continuous: E[g(X)]=∫ g(x)f(x)dx. You don't need the distribution of Y=g(X) to compute its expectation.
27
Define Var(X) and SD(X). Give the computational shortcut.
Var(X)=E[(X−E[X])^2]. SD(X)=√Var(X). Shortcut: Var(X)=E[X^2]−(E[X])^2.
28
State linearity of expectation. Does it require independence?
E[aX+bY+c]=aE[X]+bE[Y]+c. No independence is required. Linearity always holds when expectations exist.
29
Give Var(X+Y) in terms of Var and Cov. When does it simplify?
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y). If X,Y independent (or just uncorrelated, Cov=0), then Var(X+Y)=Var(X)+Var(Y).
30
Define Cov(X,Y) and Corr(X,Y). What do they measure?
Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y]. Correlation ρ=Cov(X,Y)/(σ_X σ_Y) in [−1,1]. They measure linear association (not general dependence).
31
Define independence for events and for random variables.
Events: A,B independent if P(A∩B)=P(A)P(B) (equiv. P(A|B)=P(A)). RVs X,Y independent if for all measurable sets A,B: P(X∈A, Y∈B)=P(X∈A)P(Y∈B). Equivalent: joint pdf/pmf factorizes: f_{X,Y}(x,y)=f_X(x)f_Y(y).
32
Is Cov(X,Y)=0 enough to conclude independence? When is it enough?
In general, no: uncorrelated does not imply independent. Exception: if (X,Y) are jointly normal (multivariate normal), then Cov(X,Y)=0 implies independence. Counterexample idea: let X~Uniform(−1,1), Y=X^2. Then Cov(X,Y)=0 but dependence is strong.
33
Give a practical checklist for computing a probability in problems.
1) Define the experiment and Ω clearly. 2) Define the event A precisely (often as a set or inequality). 3) Choose a tool: counting (equally likely), PMF/PDF/CDF, conditioning, complements, or total probability. 4) Compute carefully; sanity-check: result in [0,1], compare to bounds, consider extreme cases.
34
If all outcomes in Ω are equally likely, how do you compute P(A)?
P(A)=|A|/|Ω|. Key: 'equally likely' is an assumption about the model; justify it (symmetry, randomization mechanism).
35
When is it easier to compute P(Aᶜ) instead of P(A)? Give typical patterns.
Use complements when A is complicated but its complement is simple. Common patterns: 'at least one' vs 'none'; 'at most' vs 'more than'; unions of many overlapping events. Compute P(A)=1−P(Aᶜ).
36
When should you condition on an event or variable to compute a probability?
Condition when the problem has hidden structure: different cases, sequential stages, or dependence. Pick B that makes A easier under each case, then use P(A)=Σ P(A|B_i)P(B_i) (total probability).
37
What real-world 'story' does the Bernoulli distribution model?
Single trial success/failure (coin flip, click/no click).
38
What real-world 'story' does the Binomial distribution model?
# successes in n independent Bernoulli trials (A/B conversions in n users).
39
What real-world 'story' does the Geometric distribution model?
# trials until first success (how many customers until first purchase).
40
What real-world 'story' does the Negative Binomial distribution model?
# trials (or failures) until r successes (how many requests until r errors).
41
What real-world 'story' does the Hypergeometric distribution model?
# successes in n draws WITHOUT replacement (quality control sampling from a finite lot).
42
What real-world 'story' does the Poisson distribution model?
# events in time/space for rare independent events with constant rate (arrivals, defects).
43
What real-world 'story' does the Exponential distribution model?
Waiting time between Poisson events; memoryless lifetime model.
44
What real-world 'story' does the Gamma distribution model?
Waiting time until k-th Poisson event; sum of exponentials.
45
What real-world 'story' does the Normal distribution model?
Sums/averages of many small effects; measurement noise; CLT.
46
What real-world 'story' does the Lognormal distribution model?
Products of many positive factors; income/size distributions.
47
What real-world 'story' does the Uniform distribution model?
All values equally likely on an interval; naive noninformative model.
48
What real-world 'story' does the Beta distribution model?
Distribution over probabilities p in [0,1]; prior/posterior for Bernoulli/Binomial.
49
What real-world 'story' does the Dirichlet distribution model?
Distribution over probability vectors; prior/posterior for Multinomial.
50
Write Bernoulli(p) PMF, mean, and variance.
X~Bernoulli(p) takes values {0,1}. PMF: P(X=1)=p, P(X=0)=1−p. E[X]=p. Var(X)=p(1−p).
51
Given data x1,...,xn ∈ {0,1} from Bernoulli(p), what are MLE and MAP under Beta(α,β) prior?
Let S=Σ xi. Log-likelihood: ℓ(p)=S log p + (n−S) log(1−p). MLE: p̂ = S/n (sample mean). If prior p~Beta(α,β), posterior is Beta(α+S, β+n−S). MAP (α,β>1): p_MAP = (α+S−1)/(α+β+n−2).
52
Write Binomial(n,p) PMF, mean, and variance.
X~Binomial(n,p) = # successes in n independent Bernoulli(p). PMF: P(X=k)=C(n,k) p^k (1−p)^{n−k}, k=0..n. E[X]=np. Var(X)=np(1−p).
53
For Binomial(n,p) with known n and observed k (or multiple samples), what are p MLE and Beta MAP?
Single observation k: likelihood ∝ p^k (1−p)^{n−k}. MLE: p̂ = k/n. With multiple Binomial samples (n_i,k_i): p̂ = (Σ k_i)/(Σ n_i). With Beta(α,β) prior: posterior Beta(α+Σk_i, β+Σ(n_i−k_i)). MAP: (α+Σk_i−1)/(α+β+Σn_i−2) (when α,β>1).
54
What are the two common parameterizations of the Geometric(p) distribution?
Convention A (trials until first success): support {1,2,...}, P(X=k)=(1−p)^{k−1}p. Convention B (failures before first success): support {0,1,2,...}, P(X=k)=(1−p)^k p. Always check which convention a source/problem uses.
55
Give mean/variance of Geometric(p) and state the memoryless property.
Using support {1,2,...}: E[X]=1/p, Var(X)=(1−p)/p^2. Memoryless: P(X>m+n | X>m)=P(X>n). (Same idea as exponential.)
56
Write the negative binomial PMF for number of trials until r successes.
Let X be the trial count when the r-th success occurs (support {r,r+1,...}). P(X=k)=C(k−1, r−1) p^r (1−p)^{k−r}. Reason: among first k−1 trials, choose positions of r−1 successes; k-th trial is success.
57
Write Hypergeometric(N, K, n) PMF and explain the setup.
Finite population size N with K successes (and N−K failures). Draw n items *without replacement*. X=# successes in sample. P(X=k)= [C(K,k) C(N−K, n−k)] / C(N,n). Support: max(0, n−(N−K)) ≤ k ≤ min(n,K).
58
Write Poisson(λ) PMF, mean, and variance.
X~Poisson(λ): P(X=k)=e^{−λ} λ^k/k!, k=0,1,2,... E[X]=λ. Var(X)=λ. Often models counts in a fixed interval when events occur with constant rate and independence assumptions.
59
Given i.i.d. data x1,...,xn ~ Poisson(λ), find MLE and MAP with Gamma(α,β) prior (rate β).
Let S=Σ xi. Log-likelihood: ℓ(λ)= S log λ − nλ + const. MLE: λ̂ = S/n (sample mean). Prior λ~Gamma(α,β) with pdf ∝ λ^{α−1} e^{−βλ}. Posterior: Gamma(α+S, β+n). MAP (α+S>1): (α+S−1)/(β+n).
60
Write Uniform(a,b) PDF, mean, and variance.
f(x)=1/(b−a) for a≤x≤b else 0. E[X]=(a+b)/2. Var(X)=(b−a)^2/12.
61
Write Exponential(λ) PDF and CDF, and state memorylessness.
Support x≥0. PDF: f(x)=λ e^{−λx}. CDF: F(x)=1−e^{−λx}. Memoryless: P(X>s+t | X>s)=P(X>t).
62
Given i.i.d. x1,...,xn ~ Exponential(λ), find MLE and MAP with Gamma(α,β) prior on λ.
Likelihood: L(λ)=λ^n exp(−λ Σx_i). MLE: λ̂ = n / Σx_i = 1 / x̄. With prior λ~Gamma(α,β) (rate β): posterior Gamma(α+n, β+Σx_i). MAP: (α+n−1)/(β+Σx_i) (requires α+n>1).
63
Define Gamma distribution (shape α, rate β). Give mean and variance.
Support x≥0. PDF: f(x)= β^α/Γ(α) · x^{α−1} e^{−βx}. Mean: α/β. Variance: α/β^2. Special cases: α=1 gives Exponential(β). Sum of α exponentials (integer α) gives Gamma.
64
Write Normal(μ,σ²) PDF, mean, and variance.
PDF: f(x)= (1/√(2πσ²)) exp(−(x−μ)²/(2σ²)). E[X]=μ, Var(X)=σ².
65
Given i.i.d. x1,...,xn ~ Normal(μ,σ²), what are the MLEs of μ and σ²?
MLE μ̂ = x̄. MLE σ̂² = (1/n) Σ (x_i−x̄)². (Note: unbiased sample variance uses 1/(n−1), not MLE.)
66
Write Beta(α,β) PDF and its mean and variance.
Support p∈[0,1]. PDF: f(p)= (1/B(α,β)) p^{α−1}(1−p)^{β−1}. Mean: α/(α+β). Var: αβ/[(α+β)²(α+β+1)].
67
Show why Beta is conjugate to Bernoulli/Binomial.
If p~Beta(α,β) and data are Bernoulli/Binomial with S successes and F failures: Posterior p|data ~ Beta(α+S, β+F). Reason: likelihood contributes p^S(1−p)^F; multiplying by prior keeps the same functional form.
68
Write Multinomial(n, p1..pk) PMF and interpret parameters.
Counts (X1,...,Xk) with Σ Xi = n and probabilities p_i (Σ p_i=1). P(X1=x1,...,Xk=xk)= n!/(∏ x_i!) ∏ p_i^{x_i}. Models n independent draws into k categories.
69
State Dirichlet(α1..αk) and why it is conjugate to Multinomial.
Dirichlet density on p=(p1..pk): f(p) ∝ ∏ p_i^{α_i−1} on simplex. If p~Dir(α) and counts x from Multinomial, posterior is Dir(α_i + x_i). Interpret α_i as 'pseudo-counts'.
70
State the Poisson limit of the Binomial and the conditions.
If X~Binomial(n,p) with n large, p small, and λ=np fixed, then X ≈ Poisson(λ). Intuition: many trials, rare success, constant expected count.
71
When can Binomial(n,p) be approximated by Normal? Include continuity correction.
If np and n(1−p) are both 'large' (common rule: ≥5 or ≥10), then X ≈ Normal(np, np(1−p)). Continuity correction: P(X≤k) ≈ Φ((k+0.5−np)/√(np(1−p))).
72
When can Poisson(λ) be approximated by Normal?
For large λ (rule of thumb: λ≥10 or 20), X~Poisson(λ) ≈ Normal(λ, λ). Use continuity correction if approximating discrete probabilities.
73
What distribution results from mixing a Poisson with a Gamma prior on its rate?
If X|λ ~ Poisson(λ) and λ ~ Gamma(α,β), then the marginal distribution of X is Negative Binomial. Interpretation: overdispersion—random rate λ inflates variance vs mean compared to pure Poisson.
74
Why does the Normal show up so often? Connect to sums and CLT.
If X_i are i.i.d. with finite mean μ and variance σ², then the standardized sum (ΣX_i − nμ)/(σ√n) converges in distribution to N(0,1) as n→∞ (CLT). So averages of many small effects become approximately normal.
75
Define a 'random sample' from a population. What makes sampling random?
A random sample usually means i.i.d. observations X1,...,Xn drawn according to the same population distribution. Randomness comes from a known randomization mechanism: each unit has a known chance of selection (e.g., simple random sampling) and measurements are not systematically biased. Key threats: selection bias, nonresponse, dependence (clustered sampling), measurement bias.
76
What is simple random sampling without replacement? What distribution does the count of successes follow?
SRS without replacement: choose n units uniformly from N without replacement. If population has K successes, the sample success count X follows Hypergeometric(N,K,n). If N is large relative to n, Hypergeometric ≈ Binomial(n, p=K/N).
77
If you sample with replacement from a finite population with success proportion p, what distribution does the # successes follow?
With replacement, draws are (approximately) independent. # successes in n draws follows Binomial(n,p) exactly (under the model).
78
What is the finite population correction and when is it used?
When sampling without replacement from a finite population, the variance of the sample mean is smaller. If sampling fraction f = n/N is not negligible (e.g., >5–10%), use FPC: √((N−n)/(N−1)). For estimating a mean: SE( x̄ ) ≈ (σ/√n)·√((N−n)/(N−1)).
79
State (informally) the law of large numbers and what it guarantees.
If X1,...,Xn are i.i.d. with mean μ, then the sample mean x̄ converges to μ as n grows. LLN is about convergence of averages (consistency), not about the distribution of x̄ for finite n.
80
State the CLT and what conditions matter most in practice.
For i.i.d. variables with finite mean μ and finite variance σ²: Z = (x̄ − μ)/(σ/√n) → N(0,1) in distribution as n→∞. In practice: independence (or weak dependence), not-too-heavy tails, and sufficient n. If variance is infinite (heavy tails), classical CLT may fail.
81
If Xi ~ Normal(μ,σ²) i.i.d., what is the distribution of the sample mean x̄?
Exactly: x̄ ~ Normal(μ, σ²/n). No approximation needed because linear combinations of normals are normal.
82
Define the one-sample t-statistic and state its distribution under normality.
If Xi ~ Normal(μ,σ²) and σ² unknown, define sample mean x̄ and sample SD s. t = (x̄ − μ)/(s/√n) follows Student t with df = n−1. This is why t-intervals/tests are exact for normal populations.
83
What is the distribution of (n−1)s²/σ² under normality?
If Xi ~ Normal(μ,σ²), then (n−1)s²/σ² ~ χ²_{n−1}. Used for confidence intervals and tests about σ².
84
How does the F distribution arise from chi-square variables?
If U~χ²_{d1}/d1 and V~χ²_{d2}/d2 are independent, then U/V ~ F_{d1,d2}. In normal samples, ratios of sample variances lead to F tests.
85
Define mean, variance, skewness, and kurtosis in terms of moments.
Mean μ=E[X]. Variance σ²=E[(X−μ)²]. Skewness γ1 = E[(X−μ)³]/σ³ (asymmetry). Excess kurtosis γ2 = E[(X−μ)⁴]/σ⁴ − 3 (tail heaviness vs normal).
86
Define the p-quantile and connect it to the CDF.
The p-quantile q_p satisfies F(q_p) ≥ p and F(q_p−) ≤ p. For continuous strictly increasing F: q_p = F^{-1}(p). Median is q_0.5.
87
Define the moment-generating function (MGF). What is it used for?
MGF: M_X(t)=E[e^{tX}] (when finite near t=0). Derivatives at 0 give moments: M'(0)=E[X], M''(0)=E[X²], etc. MGFs help prove distributional identities and sums of independent RVs: M_{X+Y}=M_X·M_Y.
88
What is the characteristic function and why is it more general than the MGF?
φ_X(t)=E[e^{itX}]. Always exists (bounded by 1), unlike MGF. Uniquely determines distribution; products correspond to sums of independent RVs. Used in advanced limit theorems.
89
Give the Bayesian updating recipe in words and symbols.
Start with prior p(θ). Data D has likelihood p(D|θ). Posterior: p(θ|D) ∝ p(D|θ)p(θ). Evidence: p(D)=∫ p(D|θ)p(θ)dθ. Posterior predictive: p(x_new|D)=∫ p(x_new|θ)p(θ|D)dθ.
90
Explain the roles of prior, likelihood, and posterior in Bayesian inference.
Prior: beliefs about parameter θ before data. Likelihood: how plausible data are under θ. Posterior: updated beliefs after seeing data; balances prior and likelihood. As n grows, likelihood often dominates (under regularity), but prior still matters with small data or weak signal.
91
What is a conjugate prior and why is it useful?
A prior is conjugate to a likelihood if the posterior is in the same family as the prior. Benefits: closed-form posterior updates (fast, interpretable), easy predictive calculations. Examples: Beta–Bernoulli/Binomial, Gamma–Poisson/Exponential, Normal–Normal (known variance), Dirichlet–Multinomial.
92
Contrast a Bayesian credible interval with a frequentist confidence interval.
Credible interval: P(θ ∈ [a,b] | data)=0.95 (probability about parameter given data). Confidence interval: procedure that yields intervals that contain the true fixed parameter θ in 95% of repeated samples. CI is NOT '95% probability θ is in this interval' under strict frequentist interpretation.
93
In frequentist vs Bayesian analysis, what is treated as random?
Frequentist: parameter θ is fixed (unknown); data are random due to sampling. Bayesian: parameter θ is random with a prior; data are observed; uncertainty about θ is modeled probabilistically.
94
Define the maximum likelihood estimator (MLE).
Given data D and parameter θ, the likelihood is L(θ)=p(D|θ). MLE: θ̂_MLE = argmax_θ L(θ) = argmax_θ log L(θ). Interpretation: choose parameter value that makes observed data most probable under the model.
95
Define the maximum a posteriori (MAP) estimator and connect it to MLE.
MAP: θ̂_MAP = argmax_θ p(θ|D). Using Bayes: p(θ|D) ∝ p(D|θ)p(θ). So MAP maximizes log-likelihood + log-prior. If prior is flat (constant), MAP=MLE.
96
Explain how L2 and L1 regularization correspond to MAP in linear regression.
Linear regression with Gaussian noise: maximizing likelihood is minimizing SSE. Add L2 penalty (ridge): minimize SSE + λ||β||² → equivalent to MAP with Gaussian prior β~N(0, τ²I). Add L1 penalty (lasso): minimize SSE + λ||β||₁ → equivalent to MAP with Laplace (double-exponential) prior on β. Regularization encodes prior beliefs about coefficient size/sparsity.
97
State the bias–variance decomposition for squared error and why it matters.
For an estimator \hat{f}(x), expected squared error at x: E[(\hat{f}(x)−f(x))²] = Bias(\hat{f}(x))² + Var(\hat{f}(x)) + noise. Regularization can increase bias but reduce variance, often lowering total error.
98
What is a sufficient statistic? Give an example.
T(X) is sufficient for θ if the data provide no extra information about θ beyond T. Factorization theorem: p(x|θ)=g(T(x),θ)h(x). Example: for Bernoulli/Binomial, S=Σx_i is sufficient for p.
99
Write the standard linear regression model and assumptions.
Model: y = Xβ + ε. Common assumptions for classical inference: ε ~ N(0, σ²I), independent, homoscedastic. Then β̂_OLS = argmin ||y−Xβ||². Normality is mainly needed for exact t/F inference; OLS still minimizes SSE without it.
100
What is the closed-form OLS estimator β̂ (when it exists)?
If X has full column rank, β̂ = (XᵀX)^{-1} Xᵀ y. If not full rank, use pseudoinverse / regularization.
101
How do you interpret a coefficient β_j in multiple linear regression?
β_j is the expected change in y for a 1-unit increase in x_j, holding other predictors fixed. Interpretation depends on model validity, scaling, interactions, and collinearity.
102
How do t-tests arise in linear regression for a single coefficient?
Under normal errors, each coefficient estimate β̂_j has a sampling distribution. Test H0: β_j=0 uses t = (β̂_j−0)/SE(β̂_j), which follows t with df = n−p. SE depends on σ̂² and (XᵀX)^{-1}.
103
What does an F-test do in regression?
Compares explained vs unexplained variance or compares nested models. H0: restricted model is true (some coefficients =0). F = ((RSS_restricted − RSS_full)/q) / (RSS_full/(n−p_full)). Under normal errors, F ~ F_{q, n−p_full}.
104
What are the common conjugate priors for Bayesian linear regression?
With Gaussian likelihood (ε normal): Often use β|σ² ~ Normal(β0, σ² V0) and σ² ~ Inverse-Gamma(a0,b0) (or Normal–Inverse-Gamma jointly). Posterior is also Normal–Inverse-Gamma with closed-form updates. Predictive distribution is Student-t.
105
What is a 95% confidence interval (CI) in the frequentist sense?
A CI is a random interval procedure I(D) computed from data D. A 95% CI satisfies: P_θ( θ ∈ I(D) ) = 0.95 over repeated samples from the model. After observing data, the interval is fixed; θ is fixed (frequentist).
106
Give the generic CI template using a point estimate and a standard error.
CI ≈ estimate ± (critical value) × (standard error). Critical value depends on desired confidence level and sampling distribution (z, t, etc.). SE depends on variance and sample design. This template comes from a pivotal quantity or asymptotic normality.
107
Define margin of error. What affects it?
MOE = (critical value) × SE. Higher confidence → larger critical value → larger MOE. Larger n → smaller SE (typically ∝1/√n) → smaller MOE. More variability (σ or p(1−p)) → larger SE → larger MOE.
108
Write the CI for a population mean when σ is known (or n large with known σ).
If x̄ ~ Normal(μ, σ²/n) and σ known: CI: x̄ ± z_{α/2} · (σ/√n). For 95%: z_{0.025}≈1.96.
109
Write the CI for μ when σ is unknown and data are (approximately) normal.
CI: x̄ ± t_{α/2, df=n−1} · (s/√n). Uses sample SD s and Student-t critical value. Exact if population is normal; robust-ish with moderate n by CLT.
110
Give the common large-sample CI for a proportion p (Wald). When does it fail?
Let p̂ = X/n. Approx CI: p̂ ± z_{α/2} √(p̂(1−p̂)/n). Fails when n small or p near 0/1 (coverage poor). Better alternatives: Wilson score, Agresti–Coull, exact (Clopper–Pearson).
111
How do you choose n to achieve a target margin of error E for a mean (σ known)?
Want MOE = z_{α/2} σ/√n ≤ E. Solve: n ≥ (z_{α/2} σ / E)². If σ unknown, use a pilot estimate or conservative bound.
112
How do you choose n to achieve target MOE E for a proportion?
MOE ≈ z_{α/2} √(p(1−p)/n) ≤ E ⇒ n ≥ z_{α/2}² p(1−p)/E². If p unknown, worst-case p=0.5 maximizes p(1−p)=0.25 ⇒ n ≥ z²·0.25/E².
113
What happens to CI width when you change confidence level (e.g., 90%→99%)?
Higher confidence uses a larger critical value (z or t), so the interval becomes wider. Tradeoff: more confidence about containing θ, but less precision (wider range).
114
If you don't know the population standard deviation σ, what do you do for inference on μ?
Use sample SD s as an estimator. If population is normal (or n moderate/large), use t-based methods: t = (x̄−μ)/(s/√n). For non-normal small n, consider bootstrap CIs or nonparametric methods.
115
Under what conditions is assuming normality reasonable for inference?
1) If the data-generating mechanism is approximately additive noise from many small sources (physics/measurement). 2) If you're working with averages/sums and n is large (CLT) — but watch out for heavy tails/outliers. 3) If diagnostics support it (QQ plot, residual plots) and tests aren't overly sensitive. In regression, normality of errors affects p-values/interval exactness more than unbiasedness.
116
Define H0 and H1 (or Ha). What does it mean to 'reject H0'?
H0: baseline/default claim (often 'no effect', 'no difference'). H1: competing claim (effect/difference exists, or direction specified). Rejecting H0 means the data are sufficiently inconsistent with H0 under the chosen test at significance α. It does NOT prove H1; it indicates evidence against H0.
117
Define Type I error, Type II error, α, β, and power.
Type I: reject H0 when H0 is true. Probability = α (significance level). Type II: fail to reject H0 when H1 is true. Probability = β. Power = 1−β: probability of detecting a true effect of a specified size.
118
What is a p-value? What is a common misinterpretation?
p-value = P(test statistic at least as extreme as observed | H0 is true). Misinterpretation: 'probability H0 is true'. That is not what a p-value is. Small p-value indicates data would be unlikely under H0 (given model assumptions).
119
Give the standard workflow for a hypothesis test.
1) Specify H0 and H1. 2) Choose a test statistic T and its distribution under H0 (exact or asymptotic). 3) Choose α (or compute p-value). 4) Compute T from data. 5) Reject H0 if p≤α (or if T in rejection region). 6) Report effect size + CI, assumptions, and practical significance.
120
When do you use one-sided vs two-sided hypotheses?
Two-sided: H1 allows deviations in either direction (μ≠μ0). Default when direction isn't pre-specified. One-sided: H1 specifies direction (μ>μ0 or μ<μ0). Use only if direction is scientifically justified *before* seeing data. One-sided tests have more power in that direction but ignore the other.
121
What happens to the Type I error rate when you run many hypothesis tests?
If you test many hypotheses, the chance of at least one false positive increases. Family-wise error rate (FWER) can be controlled by Bonferroni or Holm. False discovery rate (FDR) can be controlled by Benjamini–Hochberg. In ML research, multiple metrics or hyperparameter sweeps can inflate false discoveries.
122
Set up the one-sample z-test for a mean with known σ.
H0: μ=μ0. Test statistic: z=(x̄−μ0)/(σ/√n). Under H0 (normal or CLT), z ~ N(0,1). Reject for |z|>z_{α/2} (two-sided) or appropriate one-sided cutoff.
123
Set up the one-sample t-test for a mean with unknown σ.
H0: μ=μ0. Statistic: t=(x̄−μ0)/(s/√n). If population normal: t ~ t_{n−1} under H0. Otherwise, approximate for moderate/large n.
124
Describe the two-sample t-test for difference of means (independent samples).
Goal: test H0: μ1−μ2 = Δ0 (often 0). Use statistic based on (x̄1−x̄2−Δ0)/SE. If variances assumed equal: pooled t-test. If variances not assumed equal: Welch's t-test (default in practice).
125
Why is Welch's t-test often preferred over pooled t-test?
Welch does not assume equal variances; it uses SE = √(s1²/n1 + s2²/n2) and an approximate df (Welch–Satterthwaite). If variances are equal, Welch performs nearly as well; if not, it protects Type I error better. So it's a safer default unless you have strong evidence of equal variances.
126
When should you use a paired t-test instead of a two-sample test?
Use paired t-test when observations are naturally matched (before/after, twin studies, same subject under two conditions). Compute differences d_i = x_i − y_i and run a one-sample t-test on d_i. Pairing removes between-subject variability, increasing power.
127
Write the Welch CI for μ1−μ2.
CI: (x̄1−x̄2) ± t_{α/2, df} · √(s1²/n1 + s2²/n2), with df given by Welch–Satterthwaite approximation. Use when samples independent and variances may differ.
128
Write the CI for the mean paired difference.
Let d_i = x_i − y_i, with mean d̄ and SD s_d, n pairs. CI for μ_d: d̄ ± t_{α/2, n−1} · (s_d/√n). Then infer μ1−μ2 via μ_d.
129
Set up the one-sample z-test for a proportion p.
H0: p=p0. With X~Binomial(n,p0): z = (p̂−p0)/√(p0(1−p0)/n) ≈ N(0,1) for large n. Use if np0 and n(1−p0) are large; otherwise consider exact binomial test.
130
Set up the two-proportion z-test (A/B test) for p1=p2.
H0: p1=p2. Use pooled estimate p̂ = (x1+x2)/(n1+n2). z = (p̂1−p̂2)/√( p̂(1−p̂)(1/n1+1/n2) ). Large-sample approx; report CI for p1−p2 as well.
131
Describe the chi-square GOF test and its test statistic.
Tests whether observed counts match expected counts under a model. Statistic: χ² = Σ_i (O_i − E_i)² / E_i over categories i. Under H0 (large sample): χ² ~ χ²_{df}, where df = (#categories − 1 − #estimated parameters). Rule: expected counts E_i should be sufficiently large (often ≥5).
132
How do you test independence in an r×c contingency table?
H0: row and column variables are independent. Expected count in cell (i,j): E_{ij} = (row_i total)(col_j total)/grand total. Statistic: χ² = Σ_{i,j} (O_{ij}−E_{ij})²/E_{ij}. df=(r−1)(c−1). Large-sample approximation; use Fisher's exact for small counts (2×2).
133
When is Fisher’s exact test preferred to chi-square in 2×2 tables?
When sample sizes are small or expected counts are low (common rule: any expected count <5). Fisher’s exact computes exact p-value under fixed margins using the hypergeometric distribution. It controls Type I error without large-sample approximations.
134
Set up an F-test for equality of two variances (normal samples).
H0: σ1²=σ2². Statistic: F = s1²/s2² (often put the larger variance on top). Under H0 with normality and independence: F ~ F_{n1−1, n2−1}. Sensitive to non-normality; Levene/Brown–Forsythe are more robust.
135
What problem does one-way ANOVA solve? What is the null hypothesis?
Compares means across ≥3 groups. H0: μ1=μ2=...=μk (all group means equal). Uses an F statistic comparing between-group variance to within-group variance.
136
Write the ANOVA F statistic in terms of mean squares.
F = MS_between / MS_within. MS_between = SS_between/(k−1), MS_within = SS_within/(N−k). Under H0 with normal errors: F ~ F_{k−1, N−k}.
137
What is a permutation test and when is it useful?
A permutation test evaluates a null hypothesis by comparing the observed test statistic to its distribution under label shuffling. Useful when parametric assumptions (normality, equal variance) are questionable. Common in ML to test if model A truly outperforms model B using paired predictions or metrics.
138
What is the bootstrap and what does it estimate?
Bootstrap resamples the observed dataset with replacement to approximate the sampling distribution of an estimator. Used to estimate standard errors, bias, and confidence intervals (percentile, BCa, etc.). Works well when i.i.d. assumption is reasonable; adapt for time series (block bootstrap).
139
How do you choose between t-tests, bootstrap, and permutation tests?
t-tests: efficient when assumptions roughly hold (approx normality of mean differences). Bootstrap: good for SE/CI of complex statistics (medians, AUC) when i.i.d. Permutation: strong for hypothesis tests of 'no difference' under exchangeability; fewer distributional assumptions. In ML, paired designs + permutation are often safer for metric differences.
140
If X~Binomial(n,p), what are E[p̂] and Var(p̂)? When is p̂ approx normal?
p̂ = X/n. E[p̂]=p. Var(p̂)=p(1−p)/n. By CLT/normal approximation, p̂ ≈ Normal(p, p(1−p)/n) when np and n(1−p) are large.
141
Define joint, marginal, and conditional distributions for (X,Y).
Joint: f_{X,Y}(x,y) (pmf or pdf). Marginal: f_X(x)=∫ f_{X,Y}(x,y) dy (or Σ_y for discrete). Conditional: f_{X|Y}(x|y)= f_{X,Y}(x,y)/f_Y(y) when f_Y(y)>0. Same ideas extend to higher dimensions.
142
How do you check if X and Y are independent using their joint distribution?
Compute marginals f_X and f_Y. If for all (x,y): f_{X,Y}(x,y) = f_X(x) f_Y(y), then X ⟂ Y. Equivalently, conditional equals marginal: f_{X|Y}(x|y)=f_X(x).
143
What is a multivariate normal (MVN) distribution and a key independence property?
X ∈ ℝ^d is MVN if every linear combination aᵀX is univariate normal. Notation: X ~ N(μ, Σ). Key: in MVN, uncorrelated components (Cov=0) are independent. More generally, block-diagonal Σ ⇒ block independence.
144
Give the conditional distribution of a partitioned MVN.
If [X;Y] ~ N([μ_X;μ_Y], [[Σ_XX, Σ_XY],[Σ_YX, Σ_YY]]), then X|Y=y ~ N( μ_X + Σ_XY Σ_YY^{-1}(y−μ_Y), Σ_XX − Σ_XY Σ_YY^{-1} Σ_YX ). Used in Gaussian conditioning / Kalman filters / GP regression.
145
How do you find the PDF of Y=g(X) for a continuous invertible transform?
If y=g(x) is 1–1 with inverse x=g^{-1}(y), then f_Y(y) = f_X(g^{-1}(y)) · |d/dy g^{-1}(y)|. In multivariate case, multiply by |det(Jacobian of inverse)|.
146
How do frequentists and Bayesians compute predictive probabilities for new data?
Frequentist: plug-in estimate θ̂ and compute p(x_new|θ̂); uncertainty handled via sampling distributions/intervals. Bayesian: integrate over posterior uncertainty: p(x_new|D)=∫ p(x_new|θ)p(θ|D)dθ. Bayesian predictive usually yields wider (more honest) uncertainty when data are limited.
147
What is E[X|Y] and how should you interpret it?
E[X|Y] is a random variable (a function of Y) giving the best mean-squared-error predictor of X given Y. Intuition: average of X among cases where Y is fixed. Key property: E[X|Y]=g(Y) for some function g.
148
State the tower property for conditional expectation.
E[X] = E[ E[X|Y] ]. More generally: E[ E[X|Y,Z] | Z ] = E[X|Z]. Think: averaging in stages gives the same overall average.
149
State the law of total variance and interpret it.
Var(X) = E[ Var(X|Y) ] + Var( E[X|Y] ). Interpretation: total variability = average within-group variability + variability of group means.
150
State Markov's inequality and when it applies.
If X≥0 and a>0, then P(X ≥ a) ≤ E[X]/a. Useful for bounding tail probabilities with only the mean known.
151
State Chebyshev's inequality and interpret it.
For any RV with mean μ and variance σ²: P(|X−μ| ≥ kσ) ≤ 1/k². Gives a distribution-free concentration bound; often loose but very general.
152
State Jensen's inequality and why it's important.
If φ is convex, then φ(E[X]) ≤ E[φ(X)]. Used throughout statistics/ML (e.g., deriving ELBOs, proving log-likelihood concavity).
153
Write the chain rule for P(A1∩...∩An).
P(A1∩...∩An) = P(A1) P(A2|A1) P(A3|A1∩A2) ... P(An|A1∩...∩A_{n−1}). For random variables: f(x1,...,xn)=f(x1)f(x2|x1)...f(xn|x1..x_{n−1}).
154
Define conditional independence X ⟂ Y | Z.
X and Y are conditionally independent given Z if f_{X,Y|Z}(x,y|z)=f_{X|Z}(x|z) f_{Y|Z}(y|z) for all z. Equivalently: P(X∈A, Y∈B | Z)=P(X∈A|Z)P(Y∈B|Z).
155
Given data (xi,yi), how can you *assess* whether X and Y are independent?
1) Plot: scatter/heatmap; check for nonlinear patterns. 2) Correlation: tests only linear dependence (Pearson). Use Spearman/Kendall for monotonic. 3) Categorical: chi-square test of independence. 4) Continuous: mutual information estimates, HSIC, distance correlation; or compare f_{X,Y} to f_X f_Y via density estimates. 5) Model-based: fit conditional model Y|X and test if X adds predictive power. No finite-sample method can prove independence; you gather evidence under assumptions.
156
Define Shannon entropy for a discrete RV and interpret it.
H(X)= −Σ_x p(x) log p(x). Measures uncertainty / average information. Maximized by uniform distribution over a finite support.
157
What is differential entropy and how does it differ from discrete entropy?
h(X)= −∫ f(x) log f(x) dx. Unlike discrete entropy, h(X) can be negative and is not invariant to reparameterization. Mutual information (difference of entropies) remains meaningful.
158
How is χ² distribution related to Gamma?
If Z ~ χ²_k, then Z ~ Gamma(shape= k/2, rate=1/2). Equivalently Gamma(shape=k/2, scale=2). This relationship helps compute moments and connect to exponential family.
159
How does Student's t arise from a Normal and a Chi-square RV?
If Z~N(0,1) and U~χ²_ν independent, then T = Z / √(U/ν) ~ t_ν. In sampling: replacing σ by s introduces the χ² term, leading to t.
160
Express an F distribution as a ratio of chi-square variables.
If U~χ²_{d1} and V~χ²_{d2} independent, then F = (U/d1)/(V/d2) ~ F_{d1,d2}.
161
Define a (homogeneous) Poisson process with rate λ.
A counting process N(t) with: 1) N(0)=0. 2) Independent increments. 3) Stationary increments: N(t+s)−N(s) ~ Poisson(λt). 4) For small h: P(1 event in h)≈λh and P(≥2 in h)=o(h). Then N(t) ~ Poisson(λt).
162
What is the distribution of interarrival times in a Poisson process?
Interarrival times are i.i.d. Exponential(λ). Waiting time to the k-th event is Gamma(k, rate=λ). These facts connect Poisson counts to exponential/gamma waiting times.
163
Compare mean and median. When is median preferred?
Mean uses all values and is sensitive to outliers. Median is the 50th percentile and is robust to extreme values. Median preferred for skewed/heavy-tailed distributions (income, response times) or when outliers are common.
164
Why does the unbiased sample variance use (n−1) instead of n?
s² = (1/(n−1)) Σ (x_i−x̄)² is unbiased for σ² when data are i.i.d. with finite variance. Using x̄ (estimated from data) 'consumes' one degree of freedom, reducing the effective information by 1.
165
What is a z-score and why standardize?
z = (x−mean)/SD. Standardization puts variables on a common scale (mean 0, SD 1), aiding comparison, numerical stability, and interpretation (especially in ML).
166
List common pitfalls when interpreting correlation.
Correlation ≠ causation. Confounding variables can induce correlation. Nonlinear relationships may have near-zero correlation. Outliers can dominate correlation. Restriction of range can reduce correlation. Simpson’s paradox: aggregated data can reverse trends seen in subgroups.
167
What does the Mann–Whitney U test compare? When use it?
Compares two independent samples without assuming normality. Tests whether one distribution tends to produce larger values than the other (often interpreted as a shift in location). Use for skewed data, ordinal outcomes, or when outliers/non-normality make t-test unreliable.
168
What does the Wilcoxon signed-rank test do? When use it?
Nonparametric alternative to paired t-test. Tests whether the median of paired differences is 0 (with symmetry assumptions). Useful when paired differences are not approximately normal.
169
What does the KS test test, and what is its main use-case?
Compares a sample CDF to a reference CDF (one-sample) or compares two sample CDFs (two-sample). Test statistic: max absolute CDF difference. Sensitive to differences in distribution shape; works for continuous distributions (ties complicate).
170
What is an A/A test and why run it in experimentation?
An A/A test splits traffic into two identical variants. Used to validate instrumentation, randomization, and false positive rates. If A/A shows systematic differences or too many 'significant' results, your experiment pipeline may be biased.
171
What inputs do you need to compute power or required sample size?
1) Significance level α. 2) Desired power 1−β. 3) Effect size of practical interest (difference in means/proportions). 4) Variability (σ) or baseline rate p. 5) Test type (two-sided vs one-sided) and design (paired vs independent). Then solve for n (often numerically).
172
Define Cohen's d for difference in means and why report it.
d = (x̄1−x̄2)/s_pooled (or use s from a relevant reference). It’s a standardized effect size, making results comparable across scales. Always interpret in context; 'small/medium/large' rules are domain dependent.
173
Given a research question about means, how do you choose between z, t, Welch, paired t, or nonparametric?
1) One sample vs two samples vs paired. 2) If σ known and normal/large n: one-sample z. 3) If σ unknown: use t. 4) Two independent groups: Welch t (default). Pooled t only if equal variances justified. 5) Matched pairs: paired t on differences. 6) Strong non-normality/outliers/small n: consider rank tests or permutation/bootstrap. Always report effect size + CI and check assumptions (independence, measurement).
174
How do you choose between binomial test, z test for proportions, chi-square, and Fisher's exact?
1) Single proportion vs two proportions vs multi-category. 2) Small n or extreme p: exact binomial test. 3) Large n: z tests/CI for one or two proportions. 4) Multi-category GOF: chi-square GOF. 5) Contingency independence: chi-square independence; if low expected counts (esp 2×2), use Fisher exact. Assumptions: independent observations, correct expected counts model.
175
Given multinomial counts x1..xk with total n, what are the MLE and Dirichlet MAP for p?
MLE: p̂_i = x_i / n. With Dirichlet prior p~Dir(α1..αk), posterior Dir(α_i + x_i). MAP (all α_i>1): p_MAP,i = (α_i + x_i − 1) / (Σ_j α_j + n − k).
176
If x_i ~ Normal(μ, σ²) with known σ² and prior μ~Normal(μ0, τ²), what is posterior for μ?
Posterior μ|x is Normal with: Precision additivity: 1/τ_post² = 1/τ² + n/σ². Mean: μ_post = τ_post²( μ0/τ² + n x̄ / σ² ). MAP = posterior mean (Normal is symmetric/unimodal).
177
For i.i.d. Uniform(0, θ), what is the MLE of θ? Is it unbiased?
Likelihood is constant for θ ≥ max(x_i) and 0 otherwise. MLE: θ̂ = max(x_i). It is biased downward: E[max] = n/(n+1) θ. Unbiased estimator: (n+1)/n · max(x_i).
178
Do Gamma(α,β) MLEs have closed-form? What is typically done?
If both α and β unknown, MLEs usually have no closed-form for α. Common approach: estimate β = α/x̄, then solve for α numerically using digamma equation involving log x̄ − average log x. In practice: use numerical optimization or method of moments as initialization.
179
If p~Beta(α,β) and you observe S successes, F failures, what is predictive P(next trial is success)?
Posterior is Beta(α+S, β+F). Predictive success probability for next Bernoulli is the posterior mean: P(X_new=1|data)= (α+S)/(α+β+S+F). This is 'add-α, add-β' smoothing (Laplace smoothing when α=β=1).
180
With prior Dirichlet(α1..αk) and counts x1..xk, what is predictive probability of next category i?
Posterior Dir(α_i + x_i). Predictive: P(next=i | data)= (α_i + x_i) / (Σ_j α_j + n). This yields additive smoothing; prevents zero probabilities when α_i>0.
181
What is the difference between likelihood L(θ) and probability P(data|θ)?
P(data|θ) is a probability as a function of data for fixed θ. Likelihood L(θ)=P(data|θ) viewed as a function of θ for fixed observed data. Likelihood values don't have to sum/integrate to 1 over θ; only relative values matter for MLE/MAP.
182
How are two-sided hypothesis tests linked to confidence intervals?
For many parameters, a two-sided α-level test rejects H0:θ=θ0 iff θ0 is outside the (1−α) confidence interval. Example: reject μ=μ0 at α=0.05 iff μ0 not in the 95% CI for μ.
183
List the most common assumptions behind classical z/t/χ²/F tests.
1) Random sampling / independence (often the most important). 2) Correct model for sampling distribution (normality for exact t/F, large-sample approximations for z/χ²). 3) For pooled t / ANOVA: equal variances (homoscedasticity). 4) For χ²: expected cell counts not too small. Violations can inflate Type I error or reduce power.
184
Why do residual diagnostics matter in regression? What do you check?
Classical inference relies on error assumptions (linearity, constant variance, independence, sometimes normality). Check: residual vs fitted (nonlinearity/heteroscedasticity), QQ plot (normality), leverage/influence (outliers), autocorrelation (time series). In ML, diagnostics help decide transformations, robust methods, or different models.
185
What is heteroscedasticity and how do robust standard errors help?
Heteroscedasticity: Var(ε|X) not constant. OLS coefficients remain unbiased (under exogeneity) but usual SEs and t/F inference can be wrong. Robust (sandwich) SEs estimate variance without assuming constant error variance.
186
If Y = exp(X) and X~Normal(μ,σ²), what is Y? Give mean and median.
Y is Lognormal(μ,σ²). Median: exp(μ). Mean: exp(μ + σ²/2). Used for positive quantities formed by multiplicative effects.
187
How do indicator variables help compute expectations?
Let I_A be 1 if event A occurs else 0. Then E[I_A]=P(A). For counts, write the count as sum of indicators and use linearity: E[Σ I_i]=Σ E[I_i]. Great for expected # of successes, collisions, etc., even with dependence.
188
What is the birthday problem and what principle does it illustrate?
Question: among n people, what's P(at least two share a birthday)? Illustrates complement rule and dependence between pair events. Compute P(no match) = 365/365 × 364/365 × ... × (365−n+1)/365; then 1−that.
189
What does the Monty Hall problem teach about conditional probability?
Host's action depends on your initial choice (not independent). Condition on the host revealing a goat: switching wins with probability 2/3. Lesson: update probabilities using conditional information and the procedure generating the evidence.
190
What is Simpson’s paradox and why does it matter in statistics/ML?
A trend appearing in several groups can reverse when groups are combined. Caused by confounding / different group sizes. Lesson: stratify by important variables; be cautious interpreting aggregated correlations or success rates.
191
When can posterior mean and MAP differ a lot?
If posterior is skewed or multimodal. MAP picks the highest-density point (mode) and can sit on a boundary. Posterior mean averages over uncertainty and can lie in low-density regions for multimodal posteriors. Choose estimator based on loss function (mean for squared error, median for absolute error, MAP for 0–1 loss on a grid).