4. Clustering Flashcards by Branden Wheeler

What is the goal of clustering?

When given a set of points with a notion of distance, group the points into a number of clusters so that members of a cluster are similar to each other and members of different clusters are dissimilar

How well did you know this?

Not at all

Perfectly

Why is clustering a hard problem?

We are often working in very high dimensional space where almost all pairs of points are at about the same distance.
There is also a large amount of data in general

How well did you know this?

Not at all

Perfectly

What are the 2 methods for clusteting?

Hierarchical (agglomerative and divisive)
Point assignment where points belong to their nearest cluster

How well did you know this?

Not at all

Perfectly

What is the difference between agglomerative and divisive hierarchical clustering?

Agglomerative is bottom-up. Each point starts as a cluster and the two nearest clusters are repeatedly combined
Divisive is top-down. Each point starts in the same cluster. We split the cluster repeatedly until we get the desired number of clusters

How well did you know this?

Not at all

Perfectly

What are the important questions associated with hierarchical clustering?

How do you represent a cluster of more than 1 point?
How do you determine the nearness of clusters
When do you stop combining clusters?

How well did you know this?

Not at all

Perfectly

How does the Euclidean space model answer the questions about hierarchical clustering?

Each cluster is represented by its centroid
Cluster distances are measured by distance between centroids
Combine until we get the desired number of clusters

How well did you know this?

Not at all

Perfectly

How does the non-Euclidean space model answer the questions about hierarchical clustering?

Each cluster is represented by a clustroid which is the data point closest to other points
Cluster distances are measured by distance between clustroids
Combine until we get the desired number of clusters

How well did you know this?

Not at all

Perfectly

What is the difference between a centroid and clustroid?

A centroid is the average of data points in the cluster. It is an artificial point
A clustroid is an existing data point that is closest to all other points in the cluster

How well did you know this?

Not at all

Perfectly

How do we determine which point becomes the clustroid?

Smallest maximum distance to other points
Smallest average distance to other points
Smallest sum of squares of distances to other points

How well did you know this?

Not at all

Perfectly

What are the 2 approaches for determining how many clusters to create?

Pick a number k up front and stop when you have k clusters
Stop when the next merge would create a cluster with low cohesion

How well did you know this?

Not at all

Perfectly

What are the 3 methods for determining cohesion of a cluster?

Diameter of the merged cluster which is max distance between points in the cluster
Radius which is the maximum distance of a point from the centroid or clustroid
Use a density based approach by taking diameter or average distance and divide by number of points in the cluster

How well did you know this?

Not at all

Perfectly

What is the issue with hierarchical clustering?

It is slow. It is O(n^3) for the naive approach and O(n^2 log n) for careful implementation

How well did you know this?

Not at all

Perfectly

How does the k-means clustering work?

Place each point in the k clusters with the nearest centroid
Update locations of centroids
Reassign all points to their closest centroid
Repeat until points don’t move between points and centroids stabilize

How well did you know this?

Not at all

Perfectly

How do we select k for k-means clustering?

Look at average distance to centroid as k increases. It will fall rapidly until the correct k and then change little

How well did you know this?

Not at all

Perfectly

What are the 2 approaches for picking the initial k points for clusters?

Sampling: Use hierarchical clustering to get k clusters, pick a point from each, and use them as centroids
Dispersed set: Pick a random point then pick k-1 points such that distance from selected points is as far as possible

How well did you know this?

Not at all

Perfectly

What is the BFR algorithm?

Study These Flashcards

A variant of k-means clustering to handle very large data sets. Meant to reduce memory usage from O(data) to O(clusters)

What is the key assumption of the BFR algorithm?

Study These Flashcards

Assumes clusters are normally distributed around a centroid in Euclidean space. Results in axis-aligned ellipses as clusters

What are the 3 classes of points in the BFR algorithm?

Study These Flashcards

Discard set: Points close enough to a centroid to be summarized
Compression set: Points that are close together but not close to a centroid. They are summarized but not added to a cluster
Retained set: Isolated points waiting to be assigned to a compression set

How is each cluster summarized using the BFR algorithm?

Study These Flashcards

The discard set of the cluster is summarized by the number of points N, the vector SUM that contains the sum of coordinates in the ith position for all i, the vector SUMSQ that contains the sum of squares of coordinates in the ith position

How do we perform the BFR algorithm?

Study These Flashcards

Read points into memory from disk one main-memory-full at a time
Points from previous memory loads are summarized by the sample statistics for the 3 point sets

How can we calculate the average in each dimension using BFR?

Study These Flashcards

For ith dimension we can calculate SUMi / N

How can we calculate variance of a cluster’s discard set using BFR?

Study These Flashcards

For ith dimension we calculate
(SUMSQi)/N) - (SUMi/N)^2

How do we decide whether to put a new point into a cluster in BFR?

Study These Flashcards

If there is a high likelihood of the point belonging to the currently nearest centroid according to the normal distribution

How do we determine if 2 compression set subclusters should be combined?

Study These Flashcards

Compute the variance of the proposed combined cluster. Combine if variance is below some threshold

What is the issue with the BFR algorithm?

Its normal distribution assumption means clusters must be ellipses fixed to an axis, cannot do angles

What is the CURE algorithm?

Clustering Using REpresentatives Assumes euclidean distance, allows clusters to be any shape, uses a collection of representative points to represent clusters

What happens during the 1st pass over the data of the CURE algorithm?

1. Pick a random sample of points that fit in main memory 2. Cluster initial points hierarchically 3. Pick representative points for each cluster as dispersed as possible. Pick furthest point and move them some distance toward the centroid

What happens during the 2nd pass over the data of the CURE algorithm?

1. Rescan the dataset 2. Place each point in the closest cluster by finding the closestt representative

4. Clustering Flashcards

(28 cards)