What is clustering?
• Finding groups in data.
• Organizing data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.
Is clustering the same as classification?
No.
Distance measures
• Euclidean distance
-> physical distance between two data points
• Manhattan distance
-> taxicab distance -> absolute difference
• Jaccard distance
-> treat two objects as sets of characteristics (text mining same word)
• Cosine distance
-> cosine of angle between two vectors (often text mining/recommend)
• Edit distance
-> Levenshtein metric -> autocorrect (spelling mistakes)
K-means clustering - how to?
Strengths k-means
- Efficient
Weaknesses k-means
Hierarchical clustering
Creates a collection of ways to group the points
Output of hierarchical clustering
Dendograms
Strengths hierarchical clustering
* Does not require to prespecify the number of clusters.
Weaknesses hierarchical clustering
* Computationally inefficient.