In practice, the probabilistic models that we use are often quite complex, and simple algorithms for exact inference, like ______________ _____________, may be too slow for them. In fact, many interesting classes of models may not admit ________ _____________-________ solutions at all.
For this reason, a significant amount of research effort in machine learning is devoted to developing algorithms that yield approximate solutions to the inference problem.
variable elimination; exact polynomial; time
What are the 2 main families of approximate inference algorithms?
1) Variational methods: which formulate inference as an optimization problem.
2) Sampling methods: which produce answers by repeatedly generating random numbers from a distribution of interest.
_____________ methods have historically been the main way of performing approximate inference, although over the past 15 years, _____________ methods have emerged as viable (and often superior) alternatives.
Sampling; variational
For most probabilistic models of practical interest, _________ inference is intractable, and so we have to resort to some form of approximation. We now consider approximate inference methods based on ____________ ___________, also known as Monte Carlo techniques. Stochastic approach to solve numerical integration problems.
exact; numerical sampling
For some applications, the posterior distribution over _____________ variables will be of direct interest in itself. For most situations, the posterior distribution is required primarily for the purpose of evaluating _______________, for example, to make _____________.
unobserved; expectations; predictions
The fundamental problem that we wish to address involves finding the _____________ of some function f(z) with respect to a probability distribution p(z).
The components of z might comprise ____________ or _____________ variables or some combination of the two.
expectation; discrete; continuous
We want to compute the ___________ value of some function. We shall suppose that such expectations are too ___________ to be evaluated exactly using _____________ techniques.
expected; complex; analytical
The general idea behind sampling methods is to obtain a set of samples drawn _______________ from the distribution p(z). This allows the expectation to be approximated by a __________ ________. This is called __________ _________ integration. The accuracy of the estimator does not depend on the dimensionality of z, and only depends on the ___________ _____ ____________.
independently; finite sum; Monte Carlo; number of samples
In the case of a ___________ graph with no observed variables, it is straightforward to sample from the joint distribution (assuming that it is possible to sample from the conditional distributions at each node) using the ____________ _____________.
directed; ancestral sampling
In the case of probability distributions defined by an _____________ graph, there is no ______-_______ sampling strategy that will sample from the prior distribution with no observed variables. Instead, computationally more expensive techniques must be employed, such as ________ sampling.
undirected; one; pass; Gibbs
What is forward sampling?
To obtain a sample from the joint distribution, we make one pass through the set of variables in the order z1,…,zM sampling from the conditional distributions p(zi|pai). At each step, all of the parent values will have been instantiated. After one pass through the graph, we will have obtained a sample from the joint distribution.
In a Bayesian network over M variables, forward sampling allows us to sample from the joint distribution x ∼ p(x) in linear time, O(M), by taking exactly one multinomial sample from each CPD.
What are the limitations of forward sampling?
Forward sampling generates samples following the joint distribution p(X) and assumes that no variables are fixed (no observed evidence). When we want to compute p(Y|E = e), we can’t directly enforce the evidence, we just hope that some samples naturally match it. But if the evidence is rare (almost no samples will match), it is very inefficient.
What’s the idea behind rejection sampling?
Generate samples as in forward sampling, but discard those that don’t match the evidence. The remaining (accepted) samples are approximately drawn from p(X|E = e). Simple and intuitive, but wasteful if the evidence is unlikely.
Rejection sampling is simple and intuitive but wasteful if evidence is ____________.
unlikely
Explain rejection sampling.
We need some simpler distribution q(z), called proposal distribution, from which we can readily draw samples. The proposal distribution needs to satisfy kq(z) ≥ p(z) for all values of z and some constant k. The function kq(z) provides an upper envelope for p. Each step of the rejection sampler involves generating two random numbers:
sample z0 ∼ q(z), which corresponds to picking a random z location; and sample u0 ∼ Unif(0,kq(z0)), which corresponds to picking a random height (y location) under the envelope. If u0 > p(z0) then the sample is rejected, otherwise u0 is retained.
p(z) here being the unnormalized version of the probability distribution
How efficient is rejection sampling?
The fraction of points that are rejected by this method depends on the ratio of the area under the unnormalized distribution to the area under the curve kq(z).
If p is a normalized target distribution, the acceptance probability is 1/k. The constant k should be as small as possible subject to the limitation that kq(z) must be nowhere less than p(z).
For rejection sampling to be of practical value, we require that the comparison function (kq(z)) be close to the target distribution so that the rate of rejection is kept to a minimum. For practical examples, where the target distribution may be multimodal and sharply peaked, it will be
extremely difficult to find a good proposal distribution and comparison function.
Rejection can be a useful technique in one or two dimensions; it is unsuited to problems of high dimensionality.
The acceptance rate diminishes exponentially with dimensionality.
____________ sampling can be a useful technique in one or two dimensions. It is, however, unsuited to problems of high dimensionality. The ___________ rate diminishes exponentially with dimensionality.
Rejection; acceptance
What are the limitations of rejection sampling?
What’s the idea behind importance sampling?
What are the limitations of importance sampling?
Works well only when the proposal q(x) is similar to the target p(x). If p(x) is high-dimensional or strongly correlated, it’s
almost impossible to design a good proposal. Leads to high variance in weights, meaning that few samples dominate the estimate.
What is the idea behind Gibbs sampling?
For low-dimensional problems, we can use methods such as ____________ sampling. However, for high-dimensional problems, it is more common to use Markov chain Monte Carlo (MCMC). _________ sampling is a simple and widely applicable MCMC algorithm. It reduces the problem of ______________ sampling to the problem of univariate sampling.
importance; Gibbs; multivariate
In the context of Gibbs sampling, z^(k+1) differs from z(k) in what?
In one single component!
Exact inference becomes computationally infeasible in large or dense networks. We replace exact computation with sampling based approximations that trade precision for scalability.True or False?
True