What is a Centroid
The mean vector of all items assigned to a given cluster (i.e. the mean of their feature vectors)
What are partitioning methods (Clustering)
Clustering methods that partition a dataset into a predefined number of clusters
Steps of Lloyd’s Algorithm for k-means clustering
What are the limitations of k-Means, in particular Lloyd’s algorithm
Advantages of k-Means
What is a common strategy for tackling cluster initialisation in k-Means
run the algorithm multiple times, select the solution(s) that scores well according to some validation measure
Explain how the centroids are initialised in the k-means++
What is the intuition for k-means++ initialisation
The cluster centroids are initialised by favouring points that are farther away from existing centroids, resulting in the centroids being spread out across the dataset.
What is cluster validation
Measures for automatically producing a quantitative evaluation of the quality of a clustering
What is the Silhouette Measure
Validation measure which quantifies degree to which each item belongs in its assigned cluster, relative to the other clusters
How to calculate the silhouette width
a = average distance to all other items in same cluster
b = average distance to all other items in nearest competing cluster
silhouette width = (b-a) / max(a, b)
How to compute the average silhouette width (ASW)
calculate overall score for a clustering by averaging the silhouette widths for all items
Brief description of Agglomerative Hierarchical Clustering
Begin with each item assigned to its own cluster. Apply a bottom-up strategy where, at each step, the most similar pair of clusters are merged
Brief description of Divisive Hierarchical Clustering
Begin with a single cluster containing all items. Apply a top-down strategy where, at each step, a chosen cluster is split into two sub-clusters
AGNES algorithm
Inputs: Distance matrix, Cluster Metric
1. Assign every item to its own cluster (leaf nodes)
2. Find the closest pair of clusters and merge them
3. Compute distances between the new cluster and each of the remaining cluster
4. Repeat from step 2 until there is one cluster (root node)
Define the three Cluster Metrics (Linkage)
Single Linkage: define cluster distance as the smallest pairwise distance between items from each cluster
Complete Linkage: define cluster distance as the largest pairwise distance between items from each cluster
Average Linkage: define cluster distance as the average of all pairwise distances between items from each cluster
Divisive Algorithms Template
Advantages of Hierarchical Clustering
Disadvantages of Hierarchical Clustering
What constitutes high quality clusters
high intra-cluster similarity and low inter-cluster similarity
in K-means, the global optimum may be found using what techniques?
What is a Medoid
A point in the cluster, whose dissimilarities with all the other points in the cluster are minimum
what is the basic idea of Partitioning Around Medoids (PAM)
start from an initial set of medoids and iteratively replace a medoid by a non-medoid if it reduces the total distance of the resulting clustering