What is the goal of clustering?
When given a set of points with a notion of distance, group the points into a number of clusters so that members of a cluster are similar to each other and members of different clusters are dissimilar
Why is clustering a hard problem?
We are often working in very high dimensional space where almost all pairs of points are at about the same distance.
There is also a large amount of data in general
What are the 2 methods for clusteting?
What is the difference between agglomerative and divisive hierarchical clustering?
Agglomerative is bottom-up. Each point starts as a cluster and the two nearest clusters are repeatedly combined
Divisive is top-down. Each point starts in the same cluster. We split the cluster repeatedly until we get the desired number of clusters
What are the important questions associated with hierarchical clustering?
How does the Euclidean space model answer the questions about hierarchical clustering?
How does the non-Euclidean space model answer the questions about hierarchical clustering?
What is the difference between a centroid and clustroid?
A centroid is the average of data points in the cluster. It is an artificial point
A clustroid is an existing data point that is closest to all other points in the cluster
How do we determine which point becomes the clustroid?
What are the 2 approaches for determining how many clusters to create?
What are the 3 methods for determining cohesion of a cluster?
What is the issue with hierarchical clustering?
It is slow. It is O(n^3) for the naive approach and O(n^2 log n) for careful implementation
How does the k-means clustering work?
How do we select k for k-means clustering?
Look at average distance to centroid as k increases. It will fall rapidly until the correct k and then change little
What are the 2 approaches for picking the initial k points for clusters?
What is the BFR algorithm?
A variant of k-means clustering to handle very large data sets. Meant to reduce memory usage from O(data) to O(clusters)
What is the key assumption of the BFR algorithm?
Assumes clusters are normally distributed around a centroid in Euclidean space. Results in axis-aligned ellipses as clusters
What are the 3 classes of points in the BFR algorithm?
How is each cluster summarized using the BFR algorithm?
The discard set of the cluster is summarized by the number of points N, the vector SUM that contains the sum of coordinates in the ith position for all i, the vector SUMSQ that contains the sum of squares of coordinates in the ith position
How do we perform the BFR algorithm?
Read points into memory from disk one main-memory-full at a time
Points from previous memory loads are summarized by the sample statistics for the 3 point sets
How can we calculate the average in each dimension using BFR?
For ith dimension we can calculate SUMi / N
How can we calculate variance of a cluster’s discard set using BFR?
For ith dimension we calculate
(SUMSQi)/N) - (SUMi/N)^2
How do we decide whether to put a new point into a cluster in BFR?
If there is a high likelihood of the point belonging to the currently nearest centroid according to the normal distribution
How do we determine if 2 compression set subclusters should be combined?
Compute the variance of the proposed combined cluster. Combine if variance is below some threshold