Block 4: Unsupervised Learning Flashcards

Question 1

Q

What is the flaw of PCA?

Answer

A

PCA looks for components with largest variance (by eigendecomposition) but sometimes large variance is not the most interesting, we could be more interested in clusters.

Question 2

Q

How does entropy measure “interestingness” with multimodal distribution (unlike Gaussian)?

Answer

A

Kullback-Leibler Divergence: DKL (g||f) = ∫g(x) log{g(x)/f(x) dx ( =0 for perfect g(x) = f(x))
Entropy: H(g) = - ∫g(x) log{g(x)} dx

Compare entropy of densities with same variance σ^2 and mean 0.
Minimised by optimal g(.) and maximised by Gaussian (= most boring).

Question 3

Q

Explain sphering

Answer

A

Aim is to make the variance matrix the identity I

take R = S^(1/2) where S is sample variance S = 1/n X^T X
sphering step: W = XR

Question 4

Q

Explain Exploratory Projection Pursuit (computational technique)

Answer

A

Center and sphere matrix X to get W
Set projected matrix U_a = Wa with unit vector a
Form density estimate f^u,a (using block 3)
Find minimiser a of entropy H{f^u,a}: initial random unit a and optimise (repeat because initial a can impact a lot the “optimal” result)

Question 5

Q

Explain Independent Component Analysis

Answer

A

Using entropy minimisation: make columns of S statistically independent and non-Gaussian:

whitening (sphere X) for identity covariance
Write S = XA for an A orthogonal
want to minimise mutual information I(Y) = sum(j=1 to p) {H(Yj) - H(X)} over A for Y = A^TX and therefore minimise sum(j=1 to p) H(Yj)

Question 6

Q

Explain flaw of eigendecomposition for ICA

Answer

A

Using eigendecomposition:
writing X = SA^T where S = √n U and A^T = DV^T /√n from X = UDV^T where vectors of S are uncorrelated but decomposition is not unique (S can be rotated)!

Question 7

Q

Express the Projection Pursuit Regression

Answer

A

f(x) = sum(m=1 to M) gm( wm^T X)

initialise data directions {w}
find {g} good smooth functions using smoothing splines
find {w} by minimising sum of squres error by weighted least squares
repeat until convergence and repeat for {wm} and {gm} if M > 1

Question 8

Q

Explain structure of neural networks

Answer

A

initialisation activation sigmoid Zm = σ(X) = σ (α0 + α^T X) ( σ not linear)
one hidden layer Z
output function to get target: Y= β0 + β^T Z (linear)

Question 9

Q

Compare PPR and ANN

Answer

A

PPR uses non-parametric {gm} whereas sigmoid {σ} are less complex but use parameters {αk} and {βk} called weights
ANN also minimising sum of squares using gradient descent alternating back propagation to update parameters (using gradient of model and learning rate) and forward propagation to predict and compute sum of squares errors with current parameters

Question 10

Q

What are the disadvantages of ANN?

Answer

A

tend to overfit with too many parameters
solutions of very sensible to input
hard to know number of hidden layers

Block 4: Unsupervised Learning Flashcards

(10 cards)