Bayesian Basics for ML Flashcards by O Cam

What is the core idea of Bayesian inference?

To treat unknown parameters as random variables and update a prior belief about them using observed data to obtain a posterior belief.

How well did you know this?

Not at all

Perfectly

In Bayesian terms, what is a prior distribution?

A probability distribution that represents our beliefs about a parameter before seeing the current data.

How well did you know this?

Not at all

Perfectly

What is a likelihood in Bayesian inference?

The probability of the observed data as a function of the parameter, reflecting how plausible the data are under different parameter values.

How well did you know this?

Not at all

Perfectly

What is a posterior distribution?

The updated distribution of the parameter after combining the prior and the likelihood using Bayes’ rule.

How well did you know this?

Not at all

Perfectly

What is Bayes’ rule in the parameter-data form?

Posterior ∝ Likelihood × Prior, or p(θ|data) ∝ p(data|θ)p(θ).

How well did you know this?

Not at all

Perfectly

What does the normalization constant in Bayes’ rule ensure?

That the posterior distribution integrates (or sums) to 1 over all possible parameter values.

How well did you know this?

Not at all

Perfectly

What is a point estimate in the Bayesian framework?

A single summary of the posterior such as the posterior mean, median, or maximum a posteriori (MAP) estimate.

How well did you know this?

Not at all

Perfectly

What is the MAP (maximum a posteriori) estimate?

The parameter value that maximizes the posterior distribution p(θ|data).

How well did you know this?

Not at all

Perfectly

How is MAP estimation related to regularized maximum likelihood?

MAP is equivalent to maximizing log-likelihood plus log-prior, which often looks like a regularized loss function.

How well did you know this?

Not at all

Perfectly

What type of prior corresponds to L2 regularization in linear models?

A Normal (Gaussian) prior on coefficients, leading to ridge-like penalties.

How well did you know this?

Not at all

Perfectly

What type of prior corresponds to L1 regularization in linear models?

A Laplace (double-exponential) prior on coefficients, leading to lasso-like penalties.

How well did you know this?

Not at all

Perfectly

Why is it useful to view regularization as a Bayesian prior?

It provides an interpretation of regularization as encoding prior beliefs about parameter magnitude and supports probabilistic reasoning.

How well did you know this?

Not at all

Perfectly

What is a conjugate prior?

A prior distribution chosen so that the posterior is in the same family as the prior when combined with a given likelihood.

How well did you know this?

Not at all

Perfectly

Why are conjugate priors convenient?

They allow closed-form posterior updates, simplifying analytic calculations and reducing computational cost.

How well did you know this?

Not at all

Perfectly

What is a conjugate prior for a Bernoulli or Binomial likelihood?

The Beta distribution is conjugate to Bernoulli and Binomial likelihoods for a probability parameter.

How well did you know this?

Not at all

Perfectly

What is a conjugate prior for a Poisson likelihood?

The Gamma distribution is conjugate for the rate parameter of a Poisson likelihood.

How well did you know this?

Not at all

Perfectly

What is a conjugate prior for a Normal likelihood with known variance and unknown mean?

Study These Flashcards

A Normal prior on the mean is conjugate, yielding a Normal posterior for the mean.

What is a Bayesian credible interval?

Study These Flashcards

An interval [a,b] such that the posterior probability that the parameter lies in [a,b] equals a chosen level (e.g., 95%).

How does a credible interval differ conceptually from a frequentist confidence interval?

Study These Flashcards

A credible interval directly expresses probability about the parameter given the data; a confidence interval concerns the long-run coverage of the procedure under repeated sampling.

What is the posterior predictive distribution?

Study These Flashcards

The distribution of a future observation, obtained by averaging the likelihood of new data over the posterior distribution of the parameters.

Why is the posterior predictive distribution useful in ML?

Study These Flashcards

It captures both parameter uncertainty and data noise, providing more realistic uncertainty about future predictions.

What is hierarchical (multilevel) modeling in the Bayesian context?

Study These Flashcards

Modeling parameters themselves as drawn from higher-level distributions, allowing sharing of information across groups or entities.

Why are hierarchical models powerful?

Study These Flashcards

They enable partial pooling across groups, improving estimates for small-sample groups and capturing structure in multi-level data.

What is partial pooling?

Study These Flashcards

An approach where group-specific estimates are shrunk towards a global mean based on data and prior, balancing between no pooling and complete pooling.

What is a posterior mean estimator for a probability parameter under a Beta–Binomial model?

Often a weighted average of the prior mean and observed proportion, with weights depending on prior strength and sample size.

Why can Bayesian updating be seen as a form of 'smart smoothing' of empirical estimates?

Priors temper extreme values from small samples, pulling estimates towards plausible ranges until more data reduces prior influence.

What role do priors play when we have lots of data?

Under regular conditions, the likelihood dominates and the influence of reasonable priors diminishes as sample size grows.

Why can prior choice matter a lot in small-sample settings?

With limited data, the prior contributes substantially to the posterior, affecting estimates and uncertainty.

What is a noninformative or weakly informative prior?

A prior chosen to have minimal influence or to rule out only extreme or implausible parameter values while remaining relatively diffuse.

Why are truly 'noninformative' priors often elusive?

Priors that appear flat in one parameterization can be highly informative under reparameterization, and boundaries/constraints complicate neutrality.

What is MCMC (Markov chain Monte Carlo) in Bayesian computation?

A class of algorithms that approximate posterior distributions by drawing samples via a Markov chain whose stationary distribution is the posterior.

Why is MCMC needed for many Bayesian models?

Because closed-form posteriors are unavailable and direct integration is intractable in high dimensions.

What are variational inference methods in Bayesian ML?

Optimization-based approximations that replace the true posterior with a simpler distribution chosen to minimize a divergence, often for scalability.

Why is approximate Bayesian inference important in modern ML?

Exact inference is often infeasible in complex models, so practical algorithms rely on approximations to capture key posterior structure.

What is Bayesian model averaging at a high level?

Combining predictions from multiple models weighted by their posterior probabilities instead of choosing a single best model.

Why is Bayesian model averaging conceptually attractive?

It accounts for model uncertainty, often improving predictive performance and calibration compared to relying on one model.

How does Bayesian thinking relate to ML hyperparameter tuning?

Bayesian optimization treats the performance as a function of hyperparameters and uses a probabilistic model to choose promising configurations.

Why is a full Bayesian treatment of deep nets often impractical?

The parameter space is huge and highly non-linear, making exact posterior inference computationally difficult.

What is a practical Bayesian-inspired idea used in ML without full inference?

Using priors as regularizers, ensembles as approximations to posterior predictive distributions, or uncertainty estimates from dropout or ensembles.

In one sentence, what is the key mental model for Bayesian stats in ML?

Bayesian methods update prior beliefs with data to produce posterior distributions, which in ML often appear as regularization, uncertainty estimates, and principled predictive distributions.

Bayesian Basics for ML Flashcards

(40 cards)