How does single linkage clustering work? (13.4)
If two points are in the same cluster, the intercluster distance is 0.
You end up creating a tree structure with the root being everything in a single cluster.
How you define distance is one way of inserting domain knowledge.
This typically leads to chain like structures.
When might median be a good distance metric in single linkage clustering? (13.4)
When the value of the distance itself doesn’t really matter, but the order is what is most important.
What is the a metric vs non metric statistic? (13.4)
Metric: the specific values matter a lot (average, the details of the number matter)
Non-Metric: the specific values don’t matter, but rank order does (eg median)
What’s the running time of Single Linkage Clustering? (13.5)
O(n^3) because:
for k times:
find distance b/t each point and next closest point which is O(n^2) and do that n times.
How does K-Means clustering work? (13.7)
Why does K-means clustering converge? (13.11)
The error is monotonically non-increasing.
When updating the partition:
•P(x) = argmin_i || x - center_i ||^2
•You’re finding the partition that is closest to x
• You only move a point from from cluster to another if that distance (error) gets smaller. So the error can only every be 0 or decreasing
When updating the center of each partition (without changing the partition):
•center_i = sum(y) / |C_i|
•Average minimizes the squared error (take derivative of E and set it equal to 0)
For this to work you must have some way of breaking ties that is deterministic and consistent.
You will never revisit a configuration you’ve already looked at.
What randomized optimization technique is most like K-means clustering? (13.10/11)
Hill Climbing - each time you set P(x) for all points and then readjust the center, you are always either decreasing the error or the error doesn’t change. So always improvement or no improvement, but never de-provement.
What are the properties of k-means clustering? (13.12)
Can k-means clustering get stuck and if so how could you avoid it? (13.12)
Yes it can get stuck!
a(b) (c)(d)
e f
if (x) are the cluster centers, then (b) is going to claim a, b, e, and f and c and d will just stay where they are.
Can be avoided via random restarts or selecting points on the fringe or making sure the cluster centers are as far away form each other as possible.
What is soft clustering and how does it work? (13.14)
This is similar to bayesian learning where we are trying to find a hypothesis that maximizes the probability of the data.
Assume data was generated as follows:
• Select one of K gaussians [ with fixed variance] uniformly
• Sample x_i from that gaussian
• Repeat n times
Goal: Find a hypothesis h = that maximizes probability of the data (Max. likelihood)
This means that each data point could probabilistically belong to more than one cluster.
What is the ML mean of the gaussian if there is just one K? (13.15)
If you know all the data is coming from the same gaussian, then the ML mean is just the mean of the data.
What is expectation maximization (EM)? (13.16/17)
Moving back and forth between soft clustering and computing the means for the soft clustering.
This is conceptually much like k-means, but you’re assigning the probability of being in each cluster to each given datapoint.
What are the properties of EM? (13.18)
What are the desirable properties of clustering algorithms? (13.19)
Which clustering scheme can achieve all the desirable properties of k means clustering? (13.21)
NONE! Impossibility theorem (Kleinberg) says no clustering scheme can achieve all three:
• richness
• scale invariance
• consistency
Why do we bother with feature selection? (14.2)
* Allows you to focus on the features that matter and ignore those that don’t
What is filtering vs wrapping in feature selection? (14.4)
Filtering:
N Features -> Search Algorithm -> Fewer Features
• The criterion of how you know if you’re doing well is buried within the search algorithm
• Can use the labels in your filter
Wrapping:
N Features -> [ Search Learning Alg.] -> Fewer Features
• In this case you pass the search results to learning algorithm, it reports back how well it does with those features and this goes back and forth until you come to a solution
What are the pros and cons of filtering and wrapping? (14.5)
Filtering:
+ Speed!
- Speed … you’re really looking at isolated features and aren’t considering feature dependence
- ignores the learning problem
Wrapping:
+ take into account model bias and learning!
- But it’s really really slow
What are some ways in which you could do filtering? (14.6)
The goal is the make the learning problem easier by reducing the number of features which reduces the number of samples necessary, up until the point where it makes the learning problem harder because you just don’t have enough differentiating features.
What are some ways in which you could do wrapping? (14.6/7)
How does forward search work? (14.7)
How does backward search work? (14.7)
Say you start with 5 (N) features: 1, 2, 3, 4, 5. In this case you’re trying to eliminate one feature at a time.
• Pass all combinations of N-1 to the learner and get result:
• 1, 2, 3, 4
• 1, 2, 4, 5
•1, 3, 4, 5
• 2, 3, 4, 5
• Keep the best combination and then repeat until your score peaks and then goes down
This is like eliminating people during tryouts.
When is a feature strongly relevant? (14.9)
If removing that feature degrades the Bayes Optimal Classifier, then it is strongly relevant.
That is to say, if you are considering ALL features and you remove this one feature and the BOC degrades, it is strongly relevant.
You can make a feature not strong just by putting a copy in there. (14.11)
When is a feature weakly relevant? (14.9)
If it is NOT strongly relevant AND adding it to some other subset of features would improve the Bayes Optimal Classifier.
So if you have a subset of features and when you add this feature it improve the BOC, it is weakly relevant.
If a feature is not strongly relevant and it is not weakly relevant then it is irrelevant.