Mixture Models - 04 Flashcards by Joana Saraiva

In a latent variable model, we assume that the observed variables are caused by, or generated by, some _____________ ___________ factors, which represent the ”true” state of the world. These models are harder to ____ than models with no latent variables.

underlying latent; fit

How well did you know this?

Not at all

Perfectly

What are the advantages of the latent variable models?

1) LVMs often have fewer parameters than models that directly represent correlation in the visible space.

2) The hidden variables in an LVM can serve as a bottleneck, which computes a compressed representation of the data → basis of unsupervised learning.

How well did you know this?

Not at all

Perfectly

A LVM is any probabilistic model in which some variables are always latent or hidden. Give a example of a LVM.

Mixture model

How well did you know this?

Not at all

Perfectly

Whta’s a LVM?

It is any probabilistic model in which some variables are always latent or hidden.

How well did you know this?

Not at all

Perfectly

Interpret an image in terms of an
underlying 3D scene, represented by objects and surfaces. Forward mapping from hidden state to visible state is often ______________ (different latent values may give rise to the same observation). The inverse mapping is ___________.

many-to-one; ill-posed

How well did you know this?

Not at all

Perfectly

Mixing weights can also be called ___________ _____________.

mixture coefficients

How well did you know this?

Not at all

Perfectly

If a bottle belongs to juice type A or B, its sugar concentration is assumed to be generated from a __________ distribution specific to that _______.

Gaussian; type

How well did you know this?

Not at all

Perfectly

The most widely used mixture model is the mixture of ____________.

Gaussians

How well did you know this?

Not at all

Perfectly

By using a sufficient number of ___________ and by adjusting their means and ______________ as well as the coefficients in the __________ combination, a GMM can be used to approximate any __________ defined on R^D.

We can use ______________ ______________ to set the values of the parameters that define the GMM distribution.

Gaussians; covariances; linear, density; maximum likelihood

How well did you know this?

Not at all

Perfectly

In the context of mixture models, the likelihood function is given by a ____________ of the probabilities of each ____________ when given the set of ______________.

product; datapoint; parameters

How well did you know this?

Not at all

Perfectly

The maximum likelihood solution for the parameters no longer has a ________-form ____________ solution due to the presence of the summation over k inside the ____________.

closed; analytical; logarithm

How well did you know this?

Not at all

Perfectly

To maximize likelihood, we can employ a powerful framework called ____________-_____________ (EM).

Expectation; Maximization

How well did you know this?

Not at all

Perfectly

The mean µ_{k} for the k-th Gaussian component is obtained by taking a ___________ _________ of all the points in the dataset, in which the ____________ factor for data point x_{n} is given by the posterior probability r_{nk} that component k is responsible for _____________ x_{n}.

weighted mean; weighting; generating

How well did you know this?

Not at all

Perfectly

Similarly, to the mean, the covariance Σ_{k} for the k-th Gaussian component is proportional to the weighted _____________ ___________ _____________, i.e., each data point is weighted by the corresponding posterior probability.

empirical scatter matrix

How well did you know this?

Not at all

Perfectly

The mixing coefficient for the k-th component is given by the average _______________ that component takes for explaining the _____________.

responsibility; datapoints

How well did you know this?

Not at all

Perfectly

The estimation of the mixing coefficients makes use of a ______________ multiplier.

Lagrange

How well did you know this?

Not at all

Perfectly

The previous results for mean, covariances, and mixing coefficients don’t constitute a _________-form solution for the parameters of the mixture model.
The responsabilities (or posterior probabilities) r_{nk}, which is the conditional probability of ____ given ____, depends on those parameters in a complex way.

These results suggest a simple iterative scheme for finding a solution to the _____________ _____________ problem. It turns out to be an instance of the _______ algorithm for the particular case of the Gaussian mixture model.

closed; z; x; maximum likelihood; EM

How well did you know this?

Not at all

Perfectly

Explain the EM algorithm for GMMs.

1) Initialize the means, covariances and mixing coefficients, and evaluate the initial value of the log likelihood.

2) E-step: use the current values for the parameters to evaluate the posterior probabilities.

3) M-step: use these probabilities to re-estimate the means, covariances, and mixing coefficients^a.

4) Evaluate the log-likelihood (eq. 1) and check for convergence of either the parameters or the log-likelihood. If the convergence is not satisfied return to step 2^b.

a^We first evaluate the new means and then use these new values to find the covariances.

b^Each update to the parameters resulting from an E-step followed by an M-step guaranteed to increase the log-likelihood function.

How well did you know this?

Not at all

Perfectly

What is the goal of the EM algorithm for GMMs?

Given a GMM, the goal is to maximize the likelihood function with respect to the parameters.

How well did you know this?

Not at all

Perfectly

What’s a full covariance matrix?

Study These Flashcards

It means the components may independently adopt any position and shape.

What’s a tied covariance matrix?

Study These Flashcards

It means they have the same shape, but the shape may be anything.

What’s a diagonal covariance matrix?

Study These Flashcards

It means the contour axes are oriented along the coordinate axes, but otherwise, the eccentricities may vary between components.

What’s a spherical covariance matrix?

Study These Flashcards

It is a ”diagonal” situation with circular contours (spherical in higher dimensions).

K-means is a special case of EM. True or False?

Study These Flashcards

True

K-means is a special case of GMM using the EM algorithm, in which we make two approximations: 1) we fix Σ_{k} =I and π_{k} =1/K for all the clusters (so we just have to estimate the ________); 2) we approximate the E-step, by replacing the _______ _____________ with ________ __________ assignment. With this approximation, the weighted MLE problem of the means of the M-step reduces a normal average.

means; soft responsabilities; hard cluster

There is a significant problem associated with the maximum likelihood framework applied to Gaussian mixture models, due to the presence of _______________. The maximization of the log-likelihood function is not a _______-posed problem because singularities will always be present and will occur whenever one of the Gaussian ________________ ”collapses” onto a specific _____________. Once we have (at least) two components in the mixture, one of the components can have a finite __________ and therefore assign a finite _______________ to all of the data points. In contrast, the other component can _________ onto a single specific data point and thereby contribute an ever-______________ ___________ value to the log likelihood. These singularities provide another example of the severe ________________ that can occur in a maximum likelihood approach.

singularities; well; components; datapoint; variance; probability; shrink; increasing additive; overfitting

Another issue with the maximum likelihood framework is related to _________________. If we have two Gaussian components, we could label them “Component A” and “Component B” but ____________ their labels doesn’t __________ the model at all. So, the parameters are not ____________ _______________. The model describes the same probability distribution under multiple equivalent ______________________.

identifiability; swapping; change; uniquely identifiable; parametrizations

What are the 2 main issues with the maximum likelihood framework for GMMs?

The presence of singularities and the problem of identifiability.

The goal of EM is to find the maximum likelihood solutions for models having latent variables. True or False?

True

When the logarithm cannot be pushed inside the ________, the log-likelihood function is _________ to optimize and results in _________________ expressions for the maximum likelihood solution.

sum; hard; complicated

In practice, we are not given the ___________ dataset {X,Z}, but only the incomplete data _____. Our state of knowledge of the values of the latent variables in ____ is only given by the posterior distribution p(Z|X,θ). Because we cannot use the ____________-data log-likelihood, we consider instead its _____________ _________ under the posterior distribution of the __________ variable, E_{Z|X,θ} ln[p(X,Z|θ)]. This corresponds to the _____-step of the EM algorithm and in the _____-step, we ____________ this expectation.

complete; X; Z; complete; expected value; latent; E; M; maximize

Given a joint distribution p(X,Z|θ) over observed variables X and latent variables Z, governed by parameters θ, the goal is to maximize the likelihood function p(X|θ) with respect to θ. 1) Choose an initial setting for the parameters θ^{old}. 2) E-step: (responsibilities) Evaluate p(Z|X,θ) 3) M-step: (weighted updates) Evaluate θ^{new} given by where θ^{new} = argmax Q(θ,θ^{old}) 4) Check for convergence of either the log-likelihood or the parameters values. If the convergence criterion is not satisfied, then return to step ____.

What are the main applications for mixture models?

1) Use mixture models as a black-box density model, p(x): Useful for a variety of tasks, such as data compression, outlier detection, and creating generative classifiers, where we model each class-conditional density p(x|y = c) by a mixture distribution. 2) Use mixture models for classification 3) Use mixture models for clustering (more common application)

Using a generative model to perform _________________ can be useful when we have ____________ data.

classification; missing

When using LVMs, we must specify the number of __________ ___________, which controls the model _____________. In the case of mixture models, we must specify K, the number of ______________. The optimal Bayesian approach is to pick the model with the ____________ marginal likelihood.

latent variables; complexity; components; largest

What are the two main problems that model selection brings up?

1) Evaluating the marginal likelihood for LVMs is quite difficult. In practice, simple approximations, such as Bayesian Information Criterion (BIC), can be used. We can also use the cross-validated likelihood as a performance measure (this can be slow, since it requires fitting each model F times, where F is the number of CV folds). 2) The need to search over a potentially large number of models. The usual approach is to perform an exhaustive search over all candidate values of K. An alternative approach is to perform stochastic sampling in the space of models.

The Akaike Information Criterion (AIC) penalizes complex models ________ heavily than BIC, since the regularization term is ________________ of N.

less; independent

Each cluster is associated with a Gaussian distribution that fills a volume of the input space, rather than being a degenerate spike. Once we have enough clusters to cover the true models of the distribution, the Bayesian Occam’s razor kicks in and starts penalizing the model for being unnecessarily complex. True or False?

True

BIC and AIC are two methods used for model selection, what is another strategy?

Incrementally Growing the Number of Mixture Components! 1) Start with a small value of K, and after each round of training, consider splitting the component with the highest mixing weight into two. The new centroids are random perturbations of the original centroid, and the new scores are half of the old scores. 2) If a new component has too small a score, or too narrow a variance, it is removed. 3) Continue this until the desired number of components is reached.

When considering ____________ problems, it is common to assume a ___________ output distribution, such as a Gaussian distribution, where the mean and variance is some function of the input. However, this will not work well for one-to-many functions, in which each _______ can have multiple possible __________. Any model that is trained to maximize likelihood using a ___________ ___________ ____________– even if the model is a flexible nonlinear model, such as a neural network– will work poorly on _____-____-________ functions. To prevent this problem of regression to the mean, we can use a _____________ mixture model.

regression; unimodal; input; outputs; unimodal output density; one; to; many; conditional

In a Mixture of Experts (MoE) model, we assume the output is a _________ __________ of K different outputs, corresponding to different __________ of the output distribution for each in __________.

weighted mixture; modes; input

What is a gating function?

It decides which expert (p(y|x,z = k)) to use, depending on the input values. p(z = k|x)

A MoE can be trained using _______ or using the EM algorithm.

SGD

The gating function and experts can be any kind of conditional ______________ model, not just a linear model. If we make them both DNNs, then the resulting model is called a mixture density network or a ________ ___________ of ___________.

probabilistic; deep mixture; experts

What are the strengths (4) and limitations (3) of the GMMs?

Strengths: 1) Flexible density estimation (can approximate any continuous distribution) 2) Probabilistic soft clustering and uncertainty 3) EM algorithm guarantees non-decreasing likelihood 4) Clear links to k-means and generative classifiers Limitations: 1) Assumes Gaussian-like cluster shapes 2) Choosing the number of components K is hard 3) Susceptible to singularities and identifiability issues

What are the strengths (4) and limitations (4) of the MoEs?

Strengths: 1) Handles multimodal outputs (avoids regression to the mean) 2) Input-dependent gating allows specialization 3) Extends naturally to deep learning (Mixture Density Nets, Switch Transformer, GLaM) 4) Sparse activation enables trillion-param LMs at fixed compute Limitations: 1) Training complexity (dead experts, imbalance) 2) Experts may be less interpretable 3) Still requires a choice of the number of experts 4) Risk of overfitting if gating too flexible

What's a switch transformer?

It is a MoE that scales up to 1.6 trillion parameters and where each tolken only activates 1 expert.

What's a GLaM?

It's a trillion scale language model (a MoE), with 64 experts per layer and where 2 experts are activated per tolken.

What are the disadvantages of k-means?

1) Lack of flexibility in cluster shape 2) Lack of probabilistic cluster assignment

Mixture Models - 04 Flashcards

(49 cards)