Learning PGM - 10 Flashcards by Joana Saraiva

What are the 2 approaches to the task of acquiring a model?

1) Construct a network by hand, typically with the help of an expert.
- However, knowledge acquisition from experts is a nontrivial task.
- Larger networks can require weeks or even months of work.

2) Construct a model from a set of instances, called model learning.
- The data instances are sampled independently from an underlying distribution p , such data instances are called independent and identically distributed (iid).
- Our ideal solution is to return a model that precisely captures the distribution p from which our data were sampled.
- This is not generally achievable, because of computational reasons and because a limited dataset provides only a rough approximation of the true underlying distribution.

How well did you know this?

Not at all

Perfectly

The learning task should be viewed as an ______________ problem, we have:

A _______________ space- set of candidate models
An _______________ function- a criterion for quantifying our preference for different models

optimization; hypothesis; objective

How well did you know this?

Not at all

Perfectly

Define parameter learning.

For the case of a fixed structure and complete data, the optimization problem is convex and can be solved optimally using simple iterated numerical optimization algorithms.
Unfortunately, each step of the optimization algorithm requires inference over the network, which can be expensive for large models.

How well did you know this?

Not at all

Perfectly

Define structure learning.

For the case of complete data, it is also formulated as an optimization problem, where different network structures are given a score, and we aim to find the network whose score is highest.

How well did you know this?

Not at all

Perfectly

When focusing on parameter learning and trying to compute the posterior probability of the parameters given the data, we are assuming that the _____________ is fixed.

We can then compute the parameter posterior by treating the set of parameters as a ___________ variable and then by performing _______________.

If we have a uniform prior, MAP estimation reduces to __________________ _______________ ________________ (MLE).

structure; hidden; inference; maximum likelihood; estimation

How well did you know this?

Not at all

Perfectly

When can we say we are learning from complete data?

If all the variables are fully observed in each case, so there is no missing data and there are no hidden variables, we say the data is complete.

How well did you know this?

Not at all

Perfectly

For a directed graphical model with complete data, the likelihood decomposes
according to the ________ ____________.

The parameter prior factorizes as well for each _________.

The posterior factorizes in a way where we can compute the posterior of each ________ independently.

graph structure; node; CPD

How well did you know this?

Not at all

Perfectly

When computing MLE for CPTs, the likelihood is given by the following product of multinomials. True or False?

True

How well did you know this?

Not at all

Perfectly

The Maximum Likelihood Estimation (MLE) for a _______________ is given by the normalized empirical _________________.

multinomial; frequencies

How well did you know this?

Not at all

Perfectly

There are many zeros in the ___________ statistics due to the ________ sample size, leading to extreme estimates for some of the probabilities.

sufficient; small

How well did you know this?

Not at all

Perfectly

Using MLE for the CPTs in a discrete Bayesian network can suffer from the zero-count problem.

What’s a correcting approach to solve this issue?

Bayesian approach can solve this problem.

Adding separate Dirichlet prior on each row of each CPT. We can compute the posterior by simply adding the pseudo-counts to the empirical counts.

How well did you know this?

Not at all

Perfectly

If your prior is Dirichlet, and your likelihood is multinomial, the posterior is also Dirichlet. True or False?

True

How well did you know this?

Not at all

Perfectly

When we have incomplete data, the likelihood (and posterior) factorizes over CPDs, so we can estimate each CPD independently. True or False?

False

How well did you know this?

Not at all

Perfectly

When we have incomplete or missing data, we can no longer compute MLE or the posterior by solving separate problems per node. For this, the log-likelihood is used with the support of optimization methods. True or False?

True

How well did you know this?

Not at all

Perfectly

A popular method for estimating the parameters of a _________ in the presence of missing data is to use the _______________ _______________ algorithm.

The basic idea is to alternate between
inferring the __________ __________ (the E-step) and estimating the _____________ given this ______________ dataset (the M-step).

Rather than returning the full posterior in the E-step, we instead return the __________ _____________ statistics (ESS), which takes much less ________.

In the M-step, we _____________ the _____________ value of the log-likelihood of the fully observed data using these ESS.

DPGM; expectation maximization; latent variables; parameters; completed; expected sufficient; space; maximize; expected

How well did you know this?

Not at all

Perfectly

ESS depends on the parameters of the current iteration. True or False?

False. It depends on the parameters of the previous iteration.

How well did you know this?

Not at all

Perfectly

We can modify the M-step of the EM algorithm to perform _______ estimation with a Dirichlet prior by simply adding pseudo-counts to the expected count.

The Baum-Welch algorithm is a special case of this, which arises when DPGM is a __________.

MAP; HMM

How well did you know this?

Not at all

Perfectly

When talking about undirected graph models, computing MLE can be computationally expensive, even in the fully _____________ case, because of the need to deal with the _____________ function.

Computing the posterior over the _______________ is even harder, because of the additional normalizing constant p(D).

Study These Flashcards

observed; partition; parameters

When restricting to the case of MRFs with log-linear potential functions, the average log-likelihood of the full dataset has the ___________ function depending on the parameters (and that cannot get a closed form).

Unlike BNs, we cannot estimate the conditional probabilities independent of each other, we need to use _____________ methods to perform _____________ estimation.

Study These Flashcards

partition; gradient; parameter

What are the 2 main approaches for structure learning?

Study These Flashcards

1) Constraint-based approach: employs the independence test to identify a set of edge constraints for the graph and then finds the best DAG that satisfies the constraints. It works well with some other prior (expert) knowledge of structure but requires lots of data samples to guarantee testing power. So it is less reliable when the number of samples is small. For example, we could distinguish V-structure and fork-structure by doing an independence test for the two variables on the sides conditional on the variable in the middle.

2) Score-based approach: defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure achieving the maximal score. It is essentially a search problem that consists of two parts: the definition of a score function and the search algorithm.

Score-based methods view a Bayesian network as specifying a ____________ model and then address learning as a model ______________ problem.

Study These Flashcards

statistical; selection

Score-based methods approach the problem of structure learning as an optimization problem. True or False?

Study These Flashcards

True

______ score is an asymptotic approximation of the Bayesian score.

Study These Flashcards

BIC

In structure learning, the likelihood function, which we used for parameter
estimation, measures the probability of the _______ given a _________.

It seems intuitive to find a model that would make the ________ as probable as possible. Our goal is to find a graph and parameters that maximize the likelihood.

To find the maximum likelihood pair (structure, parameters), we should find the graph structure G that achieves the highest likelihood when we use the ______ parameters for G.

Study These Flashcards

data; model; data; MLE

What are the limitations (3) of the maximum likelihood score?

The likelihood score is a good measure of the fit of the estimated Bayesian network and the training data. In learning structure, we are also concerned about the performance of the learned network on new instances sampled from the same underlying distribution p . 1) The likelihood score can run into problems. The maximum likelihood network will exhibit conditional independence only when that independence happens to hold exactly in the empirical distribution. 2) Due to statistical noise, exact independence almost never occurs, and therefore, in almost all cases, the maximum likelihood network will be a fully connected one. The likelihood overfits the training data, learning a model that precisely fits the specifics of the empirical distribution in our training set. This model fails to generalize well to new data cases: these are sampled from the underlying distribution, which is not identical to the empirical distribution in our training set. 3) The maximum likelihood score never prefers the simpler network over the more complex one.

The main principle of the Bayesian approach for structure learning was that whenever we have uncertainty over anything, we should place a distribution over it. True or False?

True

In the context of the Bayesian score, we have uncertainty both over the structure and over the parameters. We define a ____________ prior that puts a prior probability on different graph structures, and a ____________ prior, that puts a probability on different choices of parameters once the ________ is given.

structure; parameter; graph

The Bayesian score takes into consideration the uncertainty over _____________ given a graph G. The prior over network structures p(G) describes our ______ for a certain structure, but it plays a relatively ________ role. From the approximation to the Bayesian score, the logarithm of the marginal likelihood grows linearly with the _________ of samples, while the prior over ___________ remains constant. Thus, the structure prior does not play an important role in asymptotic analysis as long as it does not rule out any structure. We often use a ___________ prior over structures.

parameters; bias; minor; number; structures; uniform

How do we represent parameter priors in the context of Bayesian score?

1) K2 prior: takes some fixed Dirichlet distribution for every parameter, where alpha is a predetermined constant. 2) BDeu prior: elicits a prior distribution over the entire probability space and an equivalent sample size for the set of imaginary samples. Typically, the prior distribution is assumed to be uniform over all possible joint configurations.

The ______ prior is simple to represent and efficient to use, but it is somewhat inconsistent. It seems that the imaginary samples we have seen for different events are a basic concept that should not _______ with different candidate structures.

K2; vary

BDeu prior avoids the inconsistencies of the K2 prior. True or False?

True

The BIC score is an approximation of the _____________ score when N tends to infinity.

Bayesian

The second term in the BIC scoring function serves as a regularization term, favoring ____________ models. The influence of model complexity _____________ as N grows, allowing the log-likelihood term to eventually dominate the score.

simpler; decreases

The Bayesian score seems to be biased toward simpler structures, but as it gets more data, it is willing to recognize that a more complex structure is necessary. In other words, it appears to trade-off fit to data with model complexity, thereby reducing the extent of overfitting. True or False?

True

What is the input for structure search?

Training set D + Scoring function (including priors, if needed) + set of possible network structures (incorporating any prior knowledge)

What is the output for structure search?

Network structure (from the set of possible structures) that maximizes the score.

The Chow-Liu algorithm is a specific type of score-based approach that finds the maximum-likelihood ______-structured graph (i.e., each node has exactly ______ parent, except for the parentless root node). The score is simply the log-likelihood; there is no __________ term for graph structure ____________ since the algorithm only considers ________ structures.

tree; one; penalty; complexity; tree

What is the first step of the Chow-Liu algorithm?

Compute the mutual information for all pairs of variables X, U, and form a complete graph from the variables where the edge between variables X and U has weight MI XU. This function measures how much information U provides about X.

What is the second step of the Chow-Liu algorithm?

Find the maximum weight spanning tree: the maximal-weight tree that connects all vertices in a graph. This can be found using Kruskal’s or Prim’s Algorithms.

What is the third step of the Chow-Liu algorithm?

Pick any node to be the root variable and assign directions radiating outward from this node (arrows go away from it). This step transforms the resulting undirected tree to a directed one. The orientation of edges does not matter because mutual information is symmetric.

The main challenge in computing the posterior over DAGs is that there are so many possible graphs. Then, an exhaustive search is ______________.

prohibitive

Finding the ___________ optimal MAP DAG is provably NP-complete and intractable for graphs with more than about _____ nodes. So, a _____________ search must be performed instead, which finds a __________ optimal MAP DAG. The most common method is __________ ________ climbing, that performs a _________ search.

globally; 16; heuristic; locally; greedy hill; local

Order the steps of the Greedy Hill Climbing algorithm. A) Continue this process until no modification improves the score. B) Consider all of the neighbors of G in the space– all of the legal networks obtained by applying a single operator to G such as adding, deleting or reversing a single edge– and compute the score for each of them. C) Pick an initial network structure G as a starting point. This network can be the empty one, a random choice, the best tree, or a network obtained from some prior knowledge. D) Apply the change that leads to the best improvement in the score. E) Compute its score.

C - E - B - D - A

Learning PGM - 10 Flashcards

(43 cards)