What are some applications of clustering?
Image segmentation
Social network analysis
Bioinformatics
What is clustering?
The process of grouping a set of instances into classes of similar instances
What are the two types of clustering algorithms?
Partitional algorithms
Hierarchical algorithms
What are some considerations we need to make for clustering?
How to measure similarity between two things
What approach to take
How many clusters
What is a centroid?
A point that is considered to be the centre of a cluster
What are the steps for the k-means algorithm?
Can the final result of k-means clustering be affected by the initial centroid choice?
Yes - some can result in a poor convergence rate, or convergence to sub optimal clusterings
What is the time complexity of k-means clustering?
O(iknd)
i - iterations
k - num of centroids
n - num of data points
d - number of features of the data points
What are the limitations of k-means clustering?
How do we measure cluster validity?
What are the two kinds of hierarchical clustering algorithms?
How does HAC work?
How does HDC work?
What are the steps for HAC?
Compute the distance matrix (= distance between any 2 data points)
Let each data point be a cluster
Repeat:
Merge the two (or more) closest clusters
Update the distance matrix
Until only a single cluster remains
What are the different ways to define closest clusters?
What is the time complexity of all HAC methods without utilising a heap?
O(dn^2)
d - number of features per data point
n - num of individual data points
What is the time complexity of all HAC methods, utilising a heap?
O(dn^2 log(n) )
d - number of features per data point
n - num of individual data points