Clustering (unsupervised)
Application of clustering
Objective of clustering
Divide the objects in the data into natural groups (clusters) such that:
- Objects in the same cluster are similar.
- Objects in different clusters are dissimilar.
Unsupervised: no class and examples determine how the data can be grouped together.
Cluster analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The distance in a cluster between data points are minimized and the distance between data points of different clusters are maximized. - Goal is to group together similar data, what does that mean? \+ Similarity measure is often more important that the clustering algorithm used. \+ Define a distance function like in k-nearest neighbors but: in clustering no class label available: discriminating attributes.
Distance
- For numerical values: \+ Euclidean distance \+ Manhattan distance - For nominal/binary attributes: \+ Proportion of unequal attributes out of the number of attributes. W
When you have mixed attributes, you need to use weights, because you don’t want one attribute to dominate everything.
Types of clustering methods
Partitioning - Combinatorical
The distances (combinatorical)
K-means clustering
Because minimizing W takes a large amount of time, the k-means clustering is developed. This is a good approximation, not exact, but fast.
Steps:
How many clusters are needed/wanted?
- Decide yourself
Selecting the right K
Hierarchical clustering
- Divisive -> top-down
Hierarchical clustering - Bottom-up
Hierarchical clustering - cluster distance
Hierarchical clustering - Top-down
Association/Clustering
Unsupervised: no class label.