What are the 2 approaches to the task of acquiring a model?
1) Construct a network by hand, typically with the help of an expert.
- However, knowledge acquisition from experts is a nontrivial task.
- Larger networks can require weeks or even months of work.
2) Construct a model from a set of instances, called model learning.
- The data instances are sampled independently from an underlying distribution p , such data instances are called independent and identically distributed (iid).
- Our ideal solution is to return a model that precisely captures the distribution p from which our data were sampled.
- This is not generally achievable, because of computational reasons and because a limited dataset provides only a rough approximation of the true underlying distribution.
The learning task should be viewed as an ______________ problem, we have:
optimization; hypothesis; objective
Define parameter learning.
Define structure learning.
For the case of complete data, it is also formulated as an optimization problem, where different network structures are given a score, and we aim to find the network whose score is highest.
When focusing on parameter learning and trying to compute the posterior probability of the parameters given the data, we are assuming that the _____________ is fixed.
We can then compute the parameter posterior by treating the set of parameters as a ___________ variable and then by performing _______________.
If we have a uniform prior, MAP estimation reduces to __________________ _______________ ________________ (MLE).
structure; hidden; inference; maximum likelihood; estimation
When can we say we are learning from complete data?
If all the variables are fully observed in each case, so there is no missing data and there are no hidden variables, we say the data is complete.
For a directed graphical model with complete data, the likelihood decomposes
according to the ________ ____________.
The parameter prior factorizes as well for each _________.
The posterior factorizes in a way where we can compute the posterior of each ________ independently.
graph structure; node; CPD
When computing MLE for CPTs, the likelihood is given by the following product of multinomials. True or False?
True
The Maximum Likelihood Estimation (MLE) for a _______________ is given by the normalized empirical _________________.
multinomial; frequencies
There are many zeros in the ___________ statistics due to the ________ sample size, leading to extreme estimates for some of the probabilities.
sufficient; small
Using MLE for the CPTs in a discrete Bayesian network can suffer from the zero-count problem.
What’s a correcting approach to solve this issue?
Bayesian approach can solve this problem.
Adding separate Dirichlet prior on each row of each CPT. We can compute the posterior by simply adding the pseudo-counts to the empirical counts.
If your prior is Dirichlet, and your likelihood is multinomial, the posterior is also Dirichlet. True or False?
True
When we have incomplete data, the likelihood (and posterior) factorizes over CPDs, so we can estimate each CPD independently. True or False?
False
When we have incomplete or missing data, we can no longer compute MLE or the posterior by solving separate problems per node. For this, the log-likelihood is used with the support of optimization methods. True or False?
True
A popular method for estimating the parameters of a _________ in the presence of missing data is to use the _______________ _______________ algorithm.
The basic idea is to alternate between
inferring the __________ __________ (the E-step) and estimating the _____________ given this ______________ dataset (the M-step).
Rather than returning the full posterior in the E-step, we instead return the __________ _____________ statistics (ESS), which takes much less ________.
In the M-step, we _____________ the _____________ value of the log-likelihood of the fully observed data using these ESS.
DPGM; expectation maximization; latent variables; parameters; completed; expected sufficient; space; maximize; expected
ESS depends on the parameters of the current iteration. True or False?
False. It depends on the parameters of the previous iteration.
We can modify the M-step of the EM algorithm to perform _______ estimation with a Dirichlet prior by simply adding pseudo-counts to the expected count.
The Baum-Welch algorithm is a special case of this, which arises when DPGM is a __________.
MAP; HMM
When talking about undirected graph models, computing MLE can be computationally expensive, even in the fully _____________ case, because of the need to deal with the _____________ function.
Computing the posterior over the _______________ is even harder, because of the additional normalizing constant p(D).
observed; partition; parameters
When restricting to the case of MRFs with log-linear potential functions, the average log-likelihood of the full dataset has the ___________ function depending on the parameters (and that cannot get a closed form).
Unlike BNs, we cannot estimate the conditional probabilities independent of each other, we need to use _____________ methods to perform _____________ estimation.
partition; gradient; parameter
What are the 2 main approaches for structure learning?
1) Constraint-based approach: employs the independence test to identify a set of edge constraints for the graph and then finds the best DAG that satisfies the constraints. It works well with some other prior (expert) knowledge of structure but requires lots of data samples to guarantee testing power. So it is less reliable when the number of samples is small. For example, we could distinguish V-structure and fork-structure by doing an independence test for the two variables on the sides conditional on the variable in the middle.
2) Score-based approach: defines a criterion to evaluate how well the Bayesian network fits the data, then searches over the space of DAGs for a structure achieving the maximal score. It is essentially a search problem that consists of two parts: the definition of a score function and the search algorithm.
Score-based methods view a Bayesian network as specifying a ____________ model and then address learning as a model ______________ problem.
statistical; selection
Score-based methods approach the problem of structure learning as an optimization problem. True or False?
True
______ score is an asymptotic approximation of the Bayesian score.
BIC
In structure learning, the likelihood function, which we used for parameter
estimation, measures the probability of the _______ given a _________.
It seems intuitive to find a model that would make the ________ as probable as possible. Our goal is to find a graph and parameters that maximize the likelihood.
To find the maximum likelihood pair (structure, parameters), we should find the graph structure G that achieves the highest likelihood when we use the ______ parameters for G.
data; model; data; MLE