Clustering
Clustering as search problem
Assessment of clustering
Purity
*external evaluation
*ratio between the number of elements of the
dominant class in a cluster and the cardinality of the cluster
RandIndex
Clustering algorithms
*Partitioning algorithms. Start with initial partition and refines it according to a criterion
-heuristics: K-means clustering
-model-based
clustering
*Hierarchical algorithms
-Bottom-up or agglomerative
-Top-down or divisive
K-means
*For each cluster we minimize the average of the distance between the
examples and the center of the cluster (centroid)
1-Generate K points, tht initial centroids
2-Instances are assigned to clusters based on the similarity/distance of
the examples from the centroids of the current clusters
3- Recalculate the position of the centroids as means of all the examples belonging to the
respective clusters
4 Repeat steps 2 and 3 until the centroids stabilize.
*examples are assumed to be real-valued vectors
Hierarchical algorithms
Agglomerative Hierarchical Clustering
*Starts from a cluster for each example and merges them gradually until a single cluster is obtained
*the history of merges forms a dendrogram
*How to measure of distance between clusters:
-Single-link: Similarity among the most similar examples in clusters, O(n^2), chaining effect
-Complete-link: Similarity among the most distant examples in the
clusters, O(n^2log2), sensitive to outliers
-Centroid: Similarity between the centroids of the clusters, O(n^2log2), non monotonic
-Average-link: Mean similarity between pairs of examples of clusters, O(n^2log2)
Dendrogram