Def: First-order Markov models vs higher-order (i.e. second-order) Markov models
A first-order Markov chain of observations {x(subscript t)} in which the distribution of particular observation {x(subscript t)} s conditioned on the value of the previous observation {x(subscript t - 1)}
A second-order Markov chain, in which the conditional distribution of a particular observation {x(subscript t)} depends on the values of the two previous observations {x(subscript t - 1)} and {x(subscript t - 2)}
What is the process of learning Markov models?
You just need to count and calculate the transition probabilities based on training data you have
Markov Models: Consider if you apply 2nd order Markov models to predict the next English word, there are many three-word sequences you may not have observed training data but only in the test phase
During training you may need to save some probabilities for unseen cases, which is called ______.
smoothing
What are the following Markov model parameters?
Start/ Initial Probability
State transition probability
Emission probability
Start/ Initial (self explanitory)
State transition - the probability of switching from one hidden state to another
Emission - The probability of observing x of a certain state (i.e. for dice room probability of rolling a 1 fair vs loaded)
Why is the key assumption of Hidden Markov Models?
The key independence property is that the hidden variables {z(subscript t - 1)} and {z(subscript t + 1)} are conditional independent given {z(subscript t)}
In hidden Markov Models, what happens if hidden states are not given?
Unlike in (non-hidden_ Markov models, observation x(subscript t) depends on all previous x(subscript i) (i
What are the three main problems of HMMs (unsupervised)?
HMM: What is the naive method of finding z* and what is the problem with this method?
The naïve method of finding z* is to compute all p(x,z|θ) for all values that z can takes
Problem: Exponential numbers of z
What is the main idea of Decoding called? and what kind of programming is this?
“Bugs on a Grid”
Dynamic programming
What are the “Bugs on a Grid” steps for decoding?
What is the time complexity and space (big O) of the Viterbi Algorithm?
Time complexity: O((K^2)n) Space O(Kn)
Underflows are a significant problem of the Viterbi algorithm. What is the solution?
p(x1,…,xn,z1,…,zn) - these numbers become extremely small - underflow
Solution: Take the logs of all values
Vk’(t+1) = log (ek’, xt+1) + maxk(Vk(t) + log ak,k’)
What are the steps that are different when doing the “Bugs on a Grid” algorithm for Evaluation? What is another name for this algorithm?
Steps 3 and 5 change:
AKA “forward algorithm”
What is the running time, and space required, for Forward and Backward algorithms?
Time complexity: O((K^2)n) Space O(Kn)
What is the posterior decoding calculation calculate?
What is the mst likely state at tiem t given sequence x?
argmaxkP(zt= k | x)
(forward algorithm and backwards algorithm allow us to calculate)
p(zt= k|x) = fk(t) bk(t)/ p(x)
Problem 3 of HMM Learning requires that parameters θ that maximize p(X|θ) are found. How can we compute the distribution of hidden states over paths if an HMM’s parameters are known?
Use the EM algorithm
How do we do the EM algorithm efficiently?
Use the “Bugs of a grid” dynamic programming trick
- Initialize HMM parameters θ =(pi,A,E)
- given a training sequence x = {x1, x2, …, xn}
E STEP: Compute useful distribution about hidden states that will be used to estimate model parameters later in the M step
1. compute p(zt= k|x) for all k = 1,…,K (calculate using forward-backward algorithm)
2. Compute p(zt=k,zt+1=k’|x) = fk(t) ak,k’ek’,xt+1bk(t) / p(x)
M STEP: Compute max likelihood estimates for model parameters, given hidden state distribution is known
1. Estimate initial probability
2. Estimate Transition probability
3. Estimate Emission Probability
Define: Deep Learning
A set of machine learning algorithms that model high-level abstractions in data using artificial neural networks
Why deep learning now?
Computers finally have computation power
Neuro net algorithms are inspired by ___
the brain
What does a Linear Neuron calculate?
A linear neuron just computes a weighted sum of input features
What is a multi layer of linear neurons?
Still a linear model
What does a Binary Threshold Neron do?
- then send out a fixed size spike of activity if the weighted sum exceeds a threshold
What does a Rectified Linear Neuron calculate?
They compute a weighted sum of their inputs. Then it is passed to a rectified linear unit. (sometimes called linear threshold neurons)